Papers
arxiv:2605.12825

Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

Published on May 12
· Submitted by
Nguyen Van Chien
on May 14
Authors:
,
,
,

Abstract

Orthrus is a dual-architecture framework that combines autoregressive LLMs with diffusion models to achieve fast parallel token generation while maintaining exact inference fidelity through shared KV caches and consensus mechanisms.

AI-generated summary

We introduce Orthrus, a simple and efficient dual-architecture framework that unifies the exact generation fidelity of autoregressive Large Language Models (LLMs) with the high-speed parallel token generation of diffusion models. The sequential nature of standard autoregressive decoding represents a fundamental bottleneck for high-throughput inference. While diffusion language models attempt to break this barrier via parallel generation, they suffer from significant performance degradation, high training costs, and a lack of rigorous convergence guarantees. Orthrus resolves this dichotomy natively. Designed to seamlessly integrate into existing Transformers, the framework augments a frozen LLM with a lightweight, trainable module to create a parallel diffusion view alongside the standard autoregressive view. In this unified system, both views attend to the exact same high-fidelity Key-Value (KV) cache; the autoregressive head executes context pre-filling to construct accurate KV representations, while the diffusion head executes parallel generation. By employing an exact consensus mechanism between the two views, Orthrus guarantees lossless inference, delivering up to a 7.8x speedup with only an O(1) memory cache overhead and minimal parameter additions.

Community

Paper author Paper submitter
edited about 19 hours ago

Fast, lossless LLM inference via dual-view diffusion decoding.
Code: https://github.com/chiennv2000/orthrus

That amount of research on dLLMs recently has been pretty inspiring. This is another example of that.

I think diffusion solves a lot of top-level issues with AR models, and I'm praying that it leads us to a better future for the industry.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.12825
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 3

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.12825 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.12825 in a Space README.md to link it from this page.

Collections including this paper 2