Stopping Computation for Converged Tokens in Masked Diffusion-LM Decoding
Daisuke Oba, Danushka Bollegala, Masahiro Kaneko, Naoaki Okazaki
TL;DR
The work tackles the inefficiency of MDLM decoding where all tokens are reprocessed each step, incurring a costly $O(N^2 d)$ attention cost. It introduces SureLock, a convergence-based locking scheme that permanently freezes stabilized, unmasked tokens and caches their $K/V$ to allow remaining tokens to attend to them, reducing per-step cost to $O(M N d)$ and producing a monotonically decreasing compute profile as sampling proceeds. A theoretical bound ties the local KL divergence at the lock step to a bound on the final token probability deviation, providing a principled justification for the locking rule. Empirically, SureLock achieves 30–50% algorithmic FLOP reductions on LLaDA-8B-Instruct with comparable generation quality and demonstrates complementary benefits when combined with orthogonal acceleration methods, offering practical gains for longer-context diffusion decoding while retaining practicality of deployment.
Abstract
Masked Diffusion Language Models generate sequences via iterative sampling that progressively unmasks tokens. However, they still recompute the attention and feed-forward blocks for every token position at every step -- even when many unmasked tokens are essentially fixed, resulting in substantial waste in compute. We propose SureLock: when the posterior at an unmasked position has stabilized across steps (our sure condition), we lock that position -- thereafter skipping its query projection and feed-forward sublayers -- while caching its attention keys and values so other positions can continue to attend to it. This reduces the dominant per-iteration computational cost from $O(N^2d)$ to $O(MNd)$ where $N$ is the sequence length, $M$ is the number of unlocked token positions, and $d$ is the model dimension. In practice, $M$ decreases as the iteration progresses, yielding substantial savings. On LLaDA-8B, SureLock reduces algorithmic FLOPs by 30--50% relative to the same sampler without locking, while maintaining comparable generation quality. We also provide a theoretical analysis to justify the design rationale of SureLock: monitoring only the local KL at the lock step suffices to bound the deviation in final token probabilities. Our code will be available at https://daioba.github.io/surelock .
