Table of Contents
Fetching ...

DPad: Efficient Diffusion Language Models with Suffix Dropout

Xinhua Chen, Sitao Huang, Cong Guo, Chiyue Wei, Yintao He, Jianyi Zhang, Hai "Helen" Li, Yiran Chen

TL;DR

DPad tackles the high computation of diffusion-based LLMs arising from full suffix attention by introducing a training-free mechanism that restricts attention to nearby suffix tokens via a sliding window and distance-decay dropout. The Scratchpad view treats suffix tokens as a cross-layer memory that is written, stored, and read in subsequent layers, enabling efficient yet effective denoising. The Diffusion Lottery Tickets hypothesis provides a theoretical lens: a sparse, Gaussian-guided subset of suffix tokens suffices to preserve accuracy, functioning as a training-free lottery ticket search. Empirically, DPad delivers up to 61.39x speedups when combined with parallel decoding and prefix caching, across LLaDA and Dream models on long sequences, while maintaining or improving accuracy in many tasks. This positions DPad as a practical, scalable component for long-sequence inference in diffusion-based language modeling.

Abstract

Diffusion-based Large Language Models (dLLMs) parallelize text generation by framing decoding as a denoising process, but suffer from high computational overhead since they predict all future suffix tokens at each step while retaining only a small fraction. We propose Diffusion Scratchpad (DPad), a training-free method that restricts attention to a small set of nearby suffix tokens, preserving fidelity while eliminating redundancy. DPad integrates two strategies: (i) a sliding window, which maintains a fixed-length suffix window, and (ii) distance-decay dropout, which deterministically removes distant suffix tokens before attention computation. This simple design is compatible with existing optimizations such as prefix caching and can be implemented with only a few lines of code. Comprehensive evaluations across multiple benchmarks on LLaDA-1.5 and Dream models demonstrate that DPad delivers up to $\mathbf{61.4\times}$ speedup over vanilla dLLMs while maintaining comparable accuracy, highlighting its potential for efficient and scalable long-sequence inference. Our code is available at https://github.com/Crys-Chen/DPad.

DPad: Efficient Diffusion Language Models with Suffix Dropout

TL;DR

DPad tackles the high computation of diffusion-based LLMs arising from full suffix attention by introducing a training-free mechanism that restricts attention to nearby suffix tokens via a sliding window and distance-decay dropout. The Scratchpad view treats suffix tokens as a cross-layer memory that is written, stored, and read in subsequent layers, enabling efficient yet effective denoising. The Diffusion Lottery Tickets hypothesis provides a theoretical lens: a sparse, Gaussian-guided subset of suffix tokens suffices to preserve accuracy, functioning as a training-free lottery ticket search. Empirically, DPad delivers up to 61.39x speedups when combined with parallel decoding and prefix caching, across LLaDA and Dream models on long sequences, while maintaining or improving accuracy in many tasks. This positions DPad as a practical, scalable component for long-sequence inference in diffusion-based language modeling.

Abstract

Diffusion-based Large Language Models (dLLMs) parallelize text generation by framing decoding as a denoising process, but suffer from high computational overhead since they predict all future suffix tokens at each step while retaining only a small fraction. We propose Diffusion Scratchpad (DPad), a training-free method that restricts attention to a small set of nearby suffix tokens, preserving fidelity while eliminating redundancy. DPad integrates two strategies: (i) a sliding window, which maintains a fixed-length suffix window, and (ii) distance-decay dropout, which deterministically removes distant suffix tokens before attention computation. This simple design is compatible with existing optimizations such as prefix caching and can be implemented with only a few lines of code. Comprehensive evaluations across multiple benchmarks on LLaDA-1.5 and Dream models demonstrate that DPad delivers up to speedup over vanilla dLLMs while maintaining comparable accuracy, highlighting its potential for efficient and scalable long-sequence inference. Our code is available at https://github.com/Crys-Chen/DPad.

Paper Structure

This paper contains 36 sections, 13 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Comparison of (a) autoregressive LLMs, (b) block-wise diffusion LLMs, and (c) our DPad. DPad restricts suffix attention via: (i)Sliding Window: fixed-length suffix window; (ii) Distance-decay Dropout: removes distant suffix tokens without computing attention scores.
  • Figure 2: Attention score maps illustrating the Scratchpad mechanism in dLLMs. The maps were generated by the LLaDA-1.5 model zhu2025llada15variancereducedpreference on prompt and 512-token sequences from the GSM8K dataset cobbe2021gsm8k. The attention matrix is divided into $3\times 3$ blocks over prefix, current, and suffix. Blocks 7 and 8 collect information from the prefix and current into the suffix at layer $n$, while Block 6 feeds this stored information back to the current block at layer $(n\!+\!1)$. This write–store–read cycle makes suffix tokens a dynamic scratchpad for cross-layer contextual reuse.
  • Figure 3: Analysis of the Suffix Drop strategy at the final layer (Layer 31), generated by the LLaDA-1.5 model zhu2025llada15variancereducedpreference on GSM8K dataset cobbe2021gsm8k with a max length of 512. We collect attention weights from 100 samples across all heads, focusing on the current block queries ($A$[$c$:$s$, $c$-200:]). To align positions, we truncate $200$ tokens before $c$ as the prefix and align all current blocks at $c$. The plot shows the mean attention distribution over key indices (green curve), together with the min–max range (shaded area). Current and suffix boundaries are marked, showing that attention on far suffix tokens rapidly decays, motivating our distance-decay dropout design.
  • Figure 4: (a) Average, maximum, and minimum attention scores of suffix tokens paid by current block tokens ($A$[$c$:$s$, $s$:]) across layers in LLaDA-1.5, showing overall decay with occasional spikes (e.g., d = 199, 298, 362). (b) After forcibly pruning these spike positions, attention shifts to nearby tokens, indicating that adjacent positions can absorb suffix information (e.g., pruning token 362 shifts the spike to token 359).
  • Figure 5: A Case Study from GSM8K on In-Context Learning and Format Adherence. The figure contrasts a baseline model's output with the same model enhanced by DPad. The baseline produces the correct answer (passing Flexible-Match) but fails to replicate the structured reasoning from the prompt, thus failing the Strict-Match. DPad successfully generates both the correct answer and the required format, passing both evaluations.
  • ...and 5 more figures