Table of Contents
Fetching ...

TiDAR: Think in Diffusion, Talk in Autoregression

Jingyu Liu, Xin Dong, Zhifan Ye, Rishabh Mehta, Yonggan Fu, Vartika Singh, Jan Kautz, Ce Zhang, Pavlo Molchanov

TL;DR

The paper addresses the challenge of achieving both high throughput and AR-level quality in diffusion language models, which typically trade speed for quality. TiDAR introduces a sequence-level hybrid architecture that drafts tokens in diffusion for parallelism and samples final outputs autoregressively in a single forward pass using specially designed structured attention masks. It trains a dual-mode backbone with joint AR and diffusion losses and uses a one-step diffusion during inference, optimizing a balanced $L_{TiDAR}$ objective, enabling fully parallel drafting with AR-style verification and exact KV caching via a single forward pass. Empirical results at 1.5B and 8B scales show 4.71x–5.91x throughput gains over AR baselines, competitive quality with AR models, and superiority over speculative decoding and block diffusion, highlighting practical serving efficiency and quality improvements.

Abstract

Diffusion language models hold the promise of fast parallel generation, while autoregressive (AR) models typically excel in quality due to their causal structure aligning naturally with language modeling. This raises a fundamental question: can we achieve a synergy with high throughput, higher GPU utilization, and AR level quality? Existing methods fail to effectively balance these two aspects, either prioritizing AR using a weaker model for sequential drafting (speculative decoding), leading to lower drafting efficiency, or using some form of left-to-right (AR-like) decoding logic for diffusion, which still suffers from quality degradation and forfeits its potential parallelizability. We introduce TiDAR, a sequence-level hybrid architecture that drafts tokens (Thinking) in Diffusion and samples final outputs (Talking) AutoRegressively - all within a single forward pass using specially designed structured attention masks. This design exploits the free GPU compute density, achieving a strong balance between drafting and verification capacity. Moreover, TiDAR is designed to be serving-friendly (low overhead) as a standalone model. We extensively evaluate TiDAR against AR models, speculative decoding, and diffusion variants across generative and likelihood tasks at 1.5B and 8B scales. Thanks to the parallel drafting and sampling as well as exact KV cache support, TiDAR outperforms speculative decoding in measured throughput and surpasses diffusion models like Dream and Llada in both efficiency and quality. Most notably, TiDAR is the first architecture to close the quality gap with AR models while delivering 4.71x to 5.91x more tokens per second.

TiDAR: Think in Diffusion, Talk in Autoregression

TL;DR

The paper addresses the challenge of achieving both high throughput and AR-level quality in diffusion language models, which typically trade speed for quality. TiDAR introduces a sequence-level hybrid architecture that drafts tokens in diffusion for parallelism and samples final outputs autoregressively in a single forward pass using specially designed structured attention masks. It trains a dual-mode backbone with joint AR and diffusion losses and uses a one-step diffusion during inference, optimizing a balanced objective, enabling fully parallel drafting with AR-style verification and exact KV caching via a single forward pass. Empirical results at 1.5B and 8B scales show 4.71x–5.91x throughput gains over AR baselines, competitive quality with AR models, and superiority over speculative decoding and block diffusion, highlighting practical serving efficiency and quality improvements.

Abstract

Diffusion language models hold the promise of fast parallel generation, while autoregressive (AR) models typically excel in quality due to their causal structure aligning naturally with language modeling. This raises a fundamental question: can we achieve a synergy with high throughput, higher GPU utilization, and AR level quality? Existing methods fail to effectively balance these two aspects, either prioritizing AR using a weaker model for sequential drafting (speculative decoding), leading to lower drafting efficiency, or using some form of left-to-right (AR-like) decoding logic for diffusion, which still suffers from quality degradation and forfeits its potential parallelizability. We introduce TiDAR, a sequence-level hybrid architecture that drafts tokens (Thinking) in Diffusion and samples final outputs (Talking) AutoRegressively - all within a single forward pass using specially designed structured attention masks. This design exploits the free GPU compute density, achieving a strong balance between drafting and verification capacity. Moreover, TiDAR is designed to be serving-friendly (low overhead) as a standalone model. We extensively evaluate TiDAR against AR models, speculative decoding, and diffusion variants across generative and likelihood tasks at 1.5B and 8B scales. Thanks to the parallel drafting and sampling as well as exact KV cache support, TiDAR outperforms speculative decoding in measured throughput and surpasses diffusion models like Dream and Llada in both efficiency and quality. Most notably, TiDAR is the first architecture to close the quality gap with AR models while delivering 4.71x to 5.91x more tokens per second.

Paper Structure

This paper contains 29 sections, 4 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Latency Scaling over Token Slots: We plot the latency of Qwen3-32B decoding on NVIDIA H100 with batch size=1 and Flash Attention 2 dao2023flashattention2fasterattentionbetter over different prefix lengths. Latency stays relatively the same with a certain amount of tokens sent to forward (free + cheap token slots), before transitioning to the compute-bound regime. We leverage this characteristic to achieve almost free parallelled drafting and sampling for TiDAR.
  • Figure 2: TiDAR Architecture: TiDAR uses a single model forward to sample drafted tokens from the last step and pre-draft tokens for the next step in parallel. By switching the attention pattern among different parts of the sequence, TiDAR encodes the clean tokens drafted from last step causally and mask tokens block-causally (bidirectional within each block) for one-step diffusion pre-drafting. Upon accepting a prefix, the corresponding pre-drafts (proposal) can be selected. The KV cache for tokens forwarded causally will be stored and later evicted if the corresponding tokens are rejected. We illustrate this with a draft length of 3 and an accepted length of 2. Figure \ref{['fig:mask']} shows the exact decoding mask for this example.
  • Figure 3: TiDAR Attention Masks: (Left) We apply a special training mask (using block length = 3): mask tokens of the same length are appended to the input tokens where the clean input tokens are self-attended causally and mask tokens within-block bidirectionally along with the prefix. During inference parallel decoding, we use a slice of a pre-initialized mask based on the prefix of the current step (Right). To reuse the mask, we reorder the sampling-draft part (tokens drafted from last step and mask positions for next step pre-drafting) and the clean prefix as illustrated with an example prefix length of 3.
  • Figure 4: Efficiency-Quality Benchmarking: We compare TiDAR on 1.5B and 8B with AR, AR with speculative decoding (EAGLE-3), and Block Diffusion. Points colored the same indicate the same model sizes while markers suggest different methods. On the y-axis we have individual task scores. On the x-axis, we showcase the relative decoding throughput speedup measured in tokens per second, with the baseline being the AR model within the same size group (Qwen2.5 1.5B Base, Qwen3 8B Base and Qwen3 8B Instruct). On top of each point, we report the average tokens per NFE. For 1.5B models, we showcase two and three different settings for Block Diffusion (threshold = max, 0.8, illustrated from left to right) and TiDAR (training block size = 4, 8, 16, illustrated from left to right) respectively.
  • Figure 5: Pareto Frontier of Different Architectures with the Same Recipe: We report the performance-efficiency trade-offs on 1.5B scale among AR model, fine-tuned AR model, Block Diffusion under different decoding thresholds, and TiDAR using different drafting lengths. Our model achieves the best Pareto Frontier compared to Block Diffusion and AR and is approaching the quality of fine-tuned AR with 7x more tokens per NFE.
  • ...and 2 more figures