Table of Contents
Fetching ...

S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation

Ligong Han, Hao Wang, Han Gao, Kai Xu, Akash Srivastava

Abstract

Block-diffusion language models offer a promising path toward faster-than-autoregressive generation by combining block-wise autoregressive decoding with within-block parallel denoising. However, in the few-step regime needed for practical acceleration, standard confidence-thresholded decoding is often brittle: aggressive thresholds hurt quality, while conservative thresholds require unnecessary denoising steps. Existing approaches that address this issue either require additional training or incur extra test-time compute. We present S2D2, a training-free self-speculative decoding framework for block-diffusion language models. Our key observation is that a block-diffusion model becomes autoregressive when the block size is reduced to one, allowing the same pretrained model to act as both drafter and verifier. S2D2 inserts a speculative verification step into standard block-diffusion decoding and uses lightweight routing policies to decide when verification is worth its cost. This yields a hybrid decoding trajectory in which diffusion proposes tokens in parallel, while the autoregressive mode acts as a local sequence-level critic. Across three mainstream block-diffusion families, S2D2 consistently improves the accuracy-speed tradeoff over strong confidence-thresholding baselines. On SDAR, we observe up to $4.7\times$ speedup over autoregressive decoding, and up to $1.57\times$ over a tuned dynamic decoding baseline while improving accuracy by up to $4.5$ points. On LLaDA2.1-Mini, S2D2 remains complementary to built-in self-correction, including a conservative setting where it is $4.4\times$ faster than the static baseline with slightly higher accuracy.

S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation

Abstract

Block-diffusion language models offer a promising path toward faster-than-autoregressive generation by combining block-wise autoregressive decoding with within-block parallel denoising. However, in the few-step regime needed for practical acceleration, standard confidence-thresholded decoding is often brittle: aggressive thresholds hurt quality, while conservative thresholds require unnecessary denoising steps. Existing approaches that address this issue either require additional training or incur extra test-time compute. We present S2D2, a training-free self-speculative decoding framework for block-diffusion language models. Our key observation is that a block-diffusion model becomes autoregressive when the block size is reduced to one, allowing the same pretrained model to act as both drafter and verifier. S2D2 inserts a speculative verification step into standard block-diffusion decoding and uses lightweight routing policies to decide when verification is worth its cost. This yields a hybrid decoding trajectory in which diffusion proposes tokens in parallel, while the autoregressive mode acts as a local sequence-level critic. Across three mainstream block-diffusion families, S2D2 consistently improves the accuracy-speed tradeoff over strong confidence-thresholding baselines. On SDAR, we observe up to speedup over autoregressive decoding, and up to over a tuned dynamic decoding baseline while improving accuracy by up to points. On LLaDA2.1-Mini, S2D2 remains complementary to built-in self-correction, including a conservative setting where it is faster than the static baseline with slightly higher accuracy.

Paper Structure

This paper contains 52 sections, 18 equations, 6 figures, 11 tables, 4 algorithms.

Figures (6)

  • Figure 1: Overview of S2D2. (a) Standard block-diffusion decoding accepts drafted tokens by confidence thresholding. (b) S2D2 inserts a self-speculative verification step: the same model under block-size-$1$ autoregressive masking verifies the first contiguous masked span, accepts tokens by rejection sampling, and falls back to standard diffusion decoding when verification is not invoked or terminates early. (c) Verification-mode attention masks for right-shifted and standard position-aligned diffusion LLMs; we draw the full-block mask for illustration, though in practice only $C_t$ is verified. (d) Lightweight routing policies decide when verification is worth its additional cost.
  • Figure 2: AR-ness (@$k$, $k=2$) and decoding-confidence statistics on GSM8K and MBPP. Top row: local and global AR-ness for SDAR-8B-Chat and LLaDA 2.1. Bottom row: normalized decoded-token confidence under static and dynamic diffusion decoding; dashed curves in dynamic decoding indicate the number of decoded tokens per step. Reference accuracies (GSM8K, MBPP): SDAR-8B-Chat, AR $(89.3\%, 64.4\%)$ and diffusion $(89.6\%, 61.0\%)$; LLaDA 2.1, AR $(90.8\%, 65.8\%)$ and diffusion $(90.8\%, 67.8\%)$.
  • Figure 3: Accuracy versus wall-clock time for SDAR-8B-Chat on GSM8K and MBPP. ITS denotes inference-time scaling, the ITS trend curve is fitted on points with accuracy $>30\%$. Across block sizes, denoising steps, and decoding schedules, S2D2 generally achieves a better accuracy--speed frontier than BD3.
  • Figure 4: Drafting and caching with autoregressive attention masks.
  • Figure 5: AR-ness and decoding confidence statistics for SDAR-8B-Chat. Top row: local/global AR-ness on GSM8K and MBPP. Bottom row: normalized confidence statistics under static and dynamic decoding on GSM8K and MBPP.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Remark 1: Local energy-guided interpretation