Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction
Yuerong Song, Xiaoran Liu, Ruixiao Li, Zhigeng Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, Xipeng Qiu
TL;DR
Sparse-dLLM addresses the heavy compute and memory burden of diffusion LLMs by revealing cross-layer sparsity and stable token saliency, then integrating dynamic bidirectional KV-cache eviction with delayed updates to accelerate inference. The method is training-free and leverages a retention-based, max-pooled attention scoring to prune noncritical KV entries, while maintaining similar peak memory to baseline. Experimental results on LLaDA and Dream show up to 10× throughput gains with comparable accuracy and memory, outperforming prior cache-based approaches in efficiency and scalability. The work provides practical, long-context acceleration for dLLMs and includes open-source code for replication.
Abstract
Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding but suffer from prohibitive quadratic computational complexity and memory overhead during inference. Current caching techniques accelerate decoding by storing full-layer states, yet impose substantial memory usage that limit long-context applications. Our analysis of attention patterns in dLLMs reveals persistent cross-layer sparsity, with pivotal tokens remaining salient across decoding steps and low-relevance tokens staying unimportant, motivating selective cache eviction. We propose Sparse-dLLM, the first training-free framework integrating dynamic cache eviction with sparse attention via delayed bidirectional sparse caching. By leveraging the stability of token saliency over steps, it retains critical tokens and dynamically evicts unimportant prefix/suffix entries using an attention-guided strategy. Extensive experiments on LLaDA and Dream series demonstrate Sparse-dLLM achieves up to 10$\times$ higher throughput than vanilla dLLMs, with comparable performance and similar peak memory costs, outperforming previous methods in efficiency and effectiveness. The code is available at https://github.com/OpenMOSS/Sparse-dLLM.
