Table of Contents
Fetching ...

DiffuSpec: Unlocking Diffusion Language Models for Speculative Decoding

Guanghao Li, Zhihui Fu, Min Fang, Qibin Zhao, Ming Tang, Chun Yuan, Jun Wang

TL;DR

This paper tackles the latency bottleneck of autoregressive decoding by introducing DiffuSpec, a training-free speculative decoding framework that uses a pretrained diffusion language model as the drafter. It tackles diffusion-specific challenges—bidirectional drafting and fixed draft length—with two components: CPS to select a causally aligned left-to-right path within a token lattice, and ADL to adaptively set subsequent draft lengths based on verifier feedback. Empirically, DiffuSpec achieves up to 3× wall-clock speedup on Spec-Bench, outperforming training-free baselines and closely approaching training-based methods under quality-locked settings. The approach requires no additional training or architecture changes to the target model, making it a practical drop-in that expands the viability of diffusion-based drafting for speculative decoding.

Abstract

As large language models (LLMs) scale up, accuracy improves, but the autoregressive (AR) nature of decoding increases latency since each token requires a serial forward pass. Speculative decoding addresses this by employing a fast drafter to propose multi-token drafts, which are then verified in parallel by the target model. However, many deployments still rely on AR drafters, where sequential passes limit wall-clock gains. We revisit the drafting stage and present DiffuSpec, a training-free drop-in framework that uses a pretrained diffusion language model (DLM) to produce multi-token drafts in a single forward pass, while remaining compatible with standard AR verifiers. Because DLM drafts are generated under bidirectional conditioning, parallel per-position candidates form a token lattice in which the locally highest-probability token at each position need not form a causal left-to-right path. Moreover, DLM drafting requires pre-specifying a draft length, inducing a speed-quality trade-off. To address these challenges, we introduce two practical components: (i) a causal-consistency path search (CPS) over this lattice that extracts a left-to-right path aligned with AR verification; and (ii) an adaptive draft-length (ADL) controller that adjusts next proposal size based on recent acceptance feedback and realized generated length. Across benchmarks, DiffuSpec yields up to 3x wall-clock speedup, establishing diffusion-based drafting as a robust alternative to autoregressive drafters for speculative decoding.

DiffuSpec: Unlocking Diffusion Language Models for Speculative Decoding

TL;DR

This paper tackles the latency bottleneck of autoregressive decoding by introducing DiffuSpec, a training-free speculative decoding framework that uses a pretrained diffusion language model as the drafter. It tackles diffusion-specific challenges—bidirectional drafting and fixed draft length—with two components: CPS to select a causally aligned left-to-right path within a token lattice, and ADL to adaptively set subsequent draft lengths based on verifier feedback. Empirically, DiffuSpec achieves up to 3× wall-clock speedup on Spec-Bench, outperforming training-free baselines and closely approaching training-based methods under quality-locked settings. The approach requires no additional training or architecture changes to the target model, making it a practical drop-in that expands the viability of diffusion-based drafting for speculative decoding.

Abstract

As large language models (LLMs) scale up, accuracy improves, but the autoregressive (AR) nature of decoding increases latency since each token requires a serial forward pass. Speculative decoding addresses this by employing a fast drafter to propose multi-token drafts, which are then verified in parallel by the target model. However, many deployments still rely on AR drafters, where sequential passes limit wall-clock gains. We revisit the drafting stage and present DiffuSpec, a training-free drop-in framework that uses a pretrained diffusion language model (DLM) to produce multi-token drafts in a single forward pass, while remaining compatible with standard AR verifiers. Because DLM drafts are generated under bidirectional conditioning, parallel per-position candidates form a token lattice in which the locally highest-probability token at each position need not form a causal left-to-right path. Moreover, DLM drafting requires pre-specifying a draft length, inducing a speed-quality trade-off. To address these challenges, we introduce two practical components: (i) a causal-consistency path search (CPS) over this lattice that extracts a left-to-right path aligned with AR verification; and (ii) an adaptive draft-length (ADL) controller that adjusts next proposal size based on recent acceptance feedback and realized generated length. Across benchmarks, DiffuSpec yields up to 3x wall-clock speedup, establishing diffusion-based drafting as a robust alternative to autoregressive drafters for speculative decoding.

Paper Structure

This paper contains 19 sections, 11 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: Speculative decoding: AR vs. DiffuSpec. (a) AR drafter: drafts are produced sequentially and then block-verified by the target AR model. (b) DiffuSpec (DLM drafter): a single forward pass proposes a block for one-shot parallel verification; within DiffuSpec, causal-consistency path search (CPS) selects a left-to-right path from the diffusion token lattice, and the adaptive draft-length (ADL) controller sets the next draft length by selecting how many masked positions to fill.
  • Figure 2: DLM token-mass diffusion (Dream-7B). Probability mass spreads across positions during joint block refinement; the per-position top-1 need not yield an AR-consistent left-to-right path under $p_\theta$.
  • Figure 3: Pruned candidate lattice and CPS. We keep tokens via a cumulative-mass threshold $\tau$ (e.g., $0.8$), always retain $\mathrm{EOS}$, early-stop after the first $\mathrm{EOS}$, and select the best path using a DLM score plus a causal ($n$-gram) proxy.
  • Figure 4: Qualitative effect of draft length. As the draft length $k_t$ increases, DLM proposals evolve from short fragments to more complete answers; once the model deems the content "complete," an early eos truncates further content.
  • Figure 5: Adaptive-length signals vs. draft length. For each $k_t$, we plot the mean and $\pm$1 standard deviation of the $\mathrm{EOS}$-aware generation length $L^{\mathrm{gen}}$ and the accepted length $L^{\mathrm{acc}}$ across evaluation prompts. The dashed diagonal $y{=}x$ marks the ideal should-generate line.
  • ...and 2 more figures