DPad: Efficient Diffusion Language Models with Suffix Dropout
Xinhua Chen, Sitao Huang, Cong Guo, Chiyue Wei, Yintao He, Jianyi Zhang, Hai "Helen" Li, Yiran Chen
TL;DR
DPad tackles the high computation of diffusion-based LLMs arising from full suffix attention by introducing a training-free mechanism that restricts attention to nearby suffix tokens via a sliding window and distance-decay dropout. The Scratchpad view treats suffix tokens as a cross-layer memory that is written, stored, and read in subsequent layers, enabling efficient yet effective denoising. The Diffusion Lottery Tickets hypothesis provides a theoretical lens: a sparse, Gaussian-guided subset of suffix tokens suffices to preserve accuracy, functioning as a training-free lottery ticket search. Empirically, DPad delivers up to 61.39x speedups when combined with parallel decoding and prefix caching, across LLaDA and Dream models on long sequences, while maintaining or improving accuracy in many tasks. This positions DPad as a practical, scalable component for long-sequence inference in diffusion-based language modeling.
Abstract
Diffusion-based Large Language Models (dLLMs) parallelize text generation by framing decoding as a denoising process, but suffer from high computational overhead since they predict all future suffix tokens at each step while retaining only a small fraction. We propose Diffusion Scratchpad (DPad), a training-free method that restricts attention to a small set of nearby suffix tokens, preserving fidelity while eliminating redundancy. DPad integrates two strategies: (i) a sliding window, which maintains a fixed-length suffix window, and (ii) distance-decay dropout, which deterministically removes distant suffix tokens before attention computation. This simple design is compatible with existing optimizations such as prefix caching and can be implemented with only a few lines of code. Comprehensive evaluations across multiple benchmarks on LLaDA-1.5 and Dream models demonstrate that DPad delivers up to $\mathbf{61.4\times}$ speedup over vanilla dLLMs while maintaining comparable accuracy, highlighting its potential for efficient and scalable long-sequence inference. Our code is available at https://github.com/Crys-Chen/DPad.
