Table of Contents
Fetching ...

Think First, Diffuse Fast: Improving Diffusion Language Model Reasoning via Autoregressive Plan Conditioning

Earl J St Sauver

Abstract

Diffusion large language models (dLLMs) generate text via iterative denoising but consistently underperform on multi-step reasoning. We hypothesize this gap stems from a coordination problem: AR models build coherence token-by-token, while diffusion models must coordinate all positions simultaneously. We propose plan conditioning, a training-free method that prepends a short (~100-token) natural-language plan from an AR model to the diffusion model's prompt. The plan serves as a frozen scaffold -- globally visible context that every token position can attend to from the first denoising step. On GSM8K, plan conditioning improves LLaDA-8B-Instruct from 75.6% to 87.2% (+11.6 percentage points), matching a same-size AR model (LLaMA 3.1 8B, 87.7%) despite a 6.4pp weaker baseline. On HumanEval, the gain is +12.8pp (37.2% to 50.0%), showing plans generalize to code. The same plans improve LLaMA by only +5.7pp on GSM8K and +1.3pp on HumanEval -- diffusion models benefit 2-10x more, supporting the coordination-problem hypothesis. Across 5 random seeds, plan-conditioned GSM8K accuracy has zero standard deviation, making diffusion inference highly stable. Ablations reveal the model follows plan strategy (wrong-strategy plans cause -16.3pp) but is robust to plan values (perturbed numbers: -1.1pp), and that planner quality has a sharp threshold: smaller Llama-class plans hurt (-1.6 to -6.8pp) while frontier plans provide the full lift. Attention analysis confirms the mechanism: plan tokens receive 1.8x excess attention during early denoising, declining to uniform as completion tokens solidify. Plan conditioning costs ~$0.002 per problem and adds ~2s of latency.

Think First, Diffuse Fast: Improving Diffusion Language Model Reasoning via Autoregressive Plan Conditioning

Abstract

Diffusion large language models (dLLMs) generate text via iterative denoising but consistently underperform on multi-step reasoning. We hypothesize this gap stems from a coordination problem: AR models build coherence token-by-token, while diffusion models must coordinate all positions simultaneously. We propose plan conditioning, a training-free method that prepends a short (~100-token) natural-language plan from an AR model to the diffusion model's prompt. The plan serves as a frozen scaffold -- globally visible context that every token position can attend to from the first denoising step. On GSM8K, plan conditioning improves LLaDA-8B-Instruct from 75.6% to 87.2% (+11.6 percentage points), matching a same-size AR model (LLaMA 3.1 8B, 87.7%) despite a 6.4pp weaker baseline. On HumanEval, the gain is +12.8pp (37.2% to 50.0%), showing plans generalize to code. The same plans improve LLaMA by only +5.7pp on GSM8K and +1.3pp on HumanEval -- diffusion models benefit 2-10x more, supporting the coordination-problem hypothesis. Across 5 random seeds, plan-conditioned GSM8K accuracy has zero standard deviation, making diffusion inference highly stable. Ablations reveal the model follows plan strategy (wrong-strategy plans cause -16.3pp) but is robust to plan values (perturbed numbers: -1.1pp), and that planner quality has a sharp threshold: smaller Llama-class plans hurt (-1.6 to -6.8pp) while frontier plans provide the full lift. Attention analysis confirms the mechanism: plan tokens receive 1.8x excess attention during early denoising, declining to uniform as completion tokens solidify. Plan conditioning costs ~$0.002 per problem and adds ~2s of latency.
Paper Structure (70 sections, 8 figures, 12 tables)

This paper contains 70 sections, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Accuracy across benchmarks and model--condition pairs. On GSM8K, plan conditioning improves LLaDA by +11.6pp (vs. +5.7pp for LLaMA with the same plans), closing the diffusion--AR gap entirely. HumanEval shows the largest diffusion advantage: LLaDA gains +12.8pp while LLaMA gains only +1.3pp. On Countdown, both architectures start at identical baselines and benefit identically (+12.1pp).
  • Figure 2: Accuracy improvement (pp) over baseline by plan format and benchmark at generation length 256. Hybrid plans (rightmost column) dominate on 3/4 benchmarks. Sudoku (bottom row) shows near-zero or negative deltas across all formats---plans do not help spatial constraint propagation. The blue cell (Countdown--Constraints, $-$4.6pp) shows that constraint-only plans hurt when they lack procedural structure.
  • Figure 3: LLaDA accuracy vs. plan token budget (hybrid format). Left: GSM8K shows smooth saturation with diminishing returns past 100 tokens. Right: Countdown exhibits a sharp performance threshold---plans hurt at 25--50 tokens, then jump +14.5pp from 50$\to$100. Dashed line marks the no-plan baseline.
  • Figure 4: GSM8K accuracy by planner capability. There is a sharp quality cliff between Haiku (helpful, +6.7pp) and Llama 8B (marginally harmful, $-$1.6pp). Self-plans (LLaDA planning for itself) are indistinguishable from no plan.
  • Figure 5: Plan quality spectrum on GSM8K. Wrong-strategy plans cause catastrophic failure ($-$16.3pp), while perturbed-numbers plans are nearly neutral ($-$1.1pp). The model follows the plan's reasoning approach but computes its own values.
  • ...and 3 more figures