Table of Contents
Fetching ...

Self Speculative Decoding for Diffusion Large Language Models

Yifeng Gao, Ziang Ji, Yuxuan Wang, Biqing Qi, Hanlin Xu, Linfeng Zhang

TL;DR

This work tackles the efficiency bottleneck of diffusion language models by introducing Self Speculative Decoding (SSD), a lossless acceleration framework that makes the model itself act as both drafter and verifier. SSD performs self-drafting for multiple masked positions in parallel and verifies drafts through a hierarchical tree within a single forward pass, eliminating auxiliary models and reducing decoding steps significantly. Empirical results across five dLLMs and four benchmarks show speedups up to $3.46\times$ while preserving stepwise-output fidelity, with the Dream family benefiting most from caching-based acceleration. The study also analyzes the theoretical acceptance-rate limits of self-drafting and trade-offs between verification-tree size and speed, offering practical guidance for deploying SSD in real-world dLLM inference. Overall, SSD advances the practicality of diffusion-based generation by bridging parallelism and exactness, enabling faster yet faithful text generation.

Abstract

Diffusion-based Large Language Models (dLLMs) have emerged as a competitive alternative to autoregressive models, offering unique advantages through bidirectional attention and parallel generation paradigms. However, the generation results of current parallel decoding methods deviate from stepwise decoding, introducing potential performance degradation, which limits their practical deployment. To address this problem, we propose \textbf{S}elf \textbf{S}peculative \textbf{D}ecoding (SSD), a lossless inference acceleration method that leverages the dLLM itself as both speculative decoding drafter and verifier without auxiliary modules. SSD introduces a self-drafting mechanism where the model generates predictions for multiple positions, then verifies them through hierarchical verification trees in a single forward pass. Unlike traditional speculative decoding that requires separate draft models, SSD eliminates model redundancy and memory overhead by exploiting the dLLM's inherent parallel prediction capability for multiple positions. This self-speculative approach allows the model to progressively verify and accept multiple tokens in a single forward pass. Our experiments demonstrate that SSD achieves up to 3.46$\times$ speedup while keeping the output identical to stepwise decoding on open source models such as LLaDA and Dream. Code will be made publicly available on GitHub.

Self Speculative Decoding for Diffusion Large Language Models

TL;DR

This work tackles the efficiency bottleneck of diffusion language models by introducing Self Speculative Decoding (SSD), a lossless acceleration framework that makes the model itself act as both drafter and verifier. SSD performs self-drafting for multiple masked positions in parallel and verifies drafts through a hierarchical tree within a single forward pass, eliminating auxiliary models and reducing decoding steps significantly. Empirical results across five dLLMs and four benchmarks show speedups up to while preserving stepwise-output fidelity, with the Dream family benefiting most from caching-based acceleration. The study also analyzes the theoretical acceptance-rate limits of self-drafting and trade-offs between verification-tree size and speed, offering practical guidance for deploying SSD in real-world dLLM inference. Overall, SSD advances the practicality of diffusion-based generation by bridging parallelism and exactness, enabling faster yet faithful text generation.

Abstract

Diffusion-based Large Language Models (dLLMs) have emerged as a competitive alternative to autoregressive models, offering unique advantages through bidirectional attention and parallel generation paradigms. However, the generation results of current parallel decoding methods deviate from stepwise decoding, introducing potential performance degradation, which limits their practical deployment. To address this problem, we propose \textbf{S}elf \textbf{S}peculative \textbf{D}ecoding (SSD), a lossless inference acceleration method that leverages the dLLM itself as both speculative decoding drafter and verifier without auxiliary modules. SSD introduces a self-drafting mechanism where the model generates predictions for multiple positions, then verifies them through hierarchical verification trees in a single forward pass. Unlike traditional speculative decoding that requires separate draft models, SSD eliminates model redundancy and memory overhead by exploiting the dLLM's inherent parallel prediction capability for multiple positions. This self-speculative approach allows the model to progressively verify and accept multiple tokens in a single forward pass. Our experiments demonstrate that SSD achieves up to 3.46 speedup while keeping the output identical to stepwise decoding on open source models such as LLaDA and Dream. Code will be made publicly available on GitHub.

Paper Structure

This paper contains 21 sections, 6 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: TPS comparison with and without cache showing memory-bound characteristics of LLaDA-8B-Instruct when cache is enabled.
  • Figure 2: Comparison of stepwise dLLM inference (bottom-left) with our SSD approach. Stepwise inference accepts one token per step following semi-autoregressive block order. SSD leverages self-drafting and hierarchical verification to accept multiple tokens per iteration. When both methods reach the same intermediate state at step $T=S$, SSD generates the next 3 tokens in step $T+1$ while stepwise requires steps $S+1$ through $S+3$ (example with draft length $k=3$).
  • Figure 3: Impact of sequence length on SSD acceleration with LLaDA-8B-Instruct. (Left) Effect of generation length (128-512 tokens) on throughput for GSM8K and MBPP. (Right) Effect of prompt length via few-shot examples (0-3 shots) on throughput for GSM8K and MBPP.
  • Figure 4: (a) Decoding may exhibit out-of-order generation. (b) Mix-order strategy supplements acceptance possibilities under out-of-order conditions.