Table of Contents
Fetching ...

Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction

Yuerong Song, Xiaoran Liu, Ruixiao Li, Zhigeng Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, Xipeng Qiu

TL;DR

Sparse-dLLM addresses the heavy compute and memory burden of diffusion LLMs by revealing cross-layer sparsity and stable token saliency, then integrating dynamic bidirectional KV-cache eviction with delayed updates to accelerate inference. The method is training-free and leverages a retention-based, max-pooled attention scoring to prune noncritical KV entries, while maintaining similar peak memory to baseline. Experimental results on LLaDA and Dream show up to 10× throughput gains with comparable accuracy and memory, outperforming prior cache-based approaches in efficiency and scalability. The work provides practical, long-context acceleration for dLLMs and includes open-source code for replication.

Abstract

Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding but suffer from prohibitive quadratic computational complexity and memory overhead during inference. Current caching techniques accelerate decoding by storing full-layer states, yet impose substantial memory usage that limit long-context applications. Our analysis of attention patterns in dLLMs reveals persistent cross-layer sparsity, with pivotal tokens remaining salient across decoding steps and low-relevance tokens staying unimportant, motivating selective cache eviction. We propose Sparse-dLLM, the first training-free framework integrating dynamic cache eviction with sparse attention via delayed bidirectional sparse caching. By leveraging the stability of token saliency over steps, it retains critical tokens and dynamically evicts unimportant prefix/suffix entries using an attention-guided strategy. Extensive experiments on LLaDA and Dream series demonstrate Sparse-dLLM achieves up to 10$\times$ higher throughput than vanilla dLLMs, with comparable performance and similar peak memory costs, outperforming previous methods in efficiency and effectiveness. The code is available at https://github.com/OpenMOSS/Sparse-dLLM.

Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction

TL;DR

Sparse-dLLM addresses the heavy compute and memory burden of diffusion LLMs by revealing cross-layer sparsity and stable token saliency, then integrating dynamic bidirectional KV-cache eviction with delayed updates to accelerate inference. The method is training-free and leverages a retention-based, max-pooled attention scoring to prune noncritical KV entries, while maintaining similar peak memory to baseline. Experimental results on LLaDA and Dream show up to 10× throughput gains with comparable accuracy and memory, outperforming prior cache-based approaches in efficiency and scalability. The work provides practical, long-context acceleration for dLLMs and includes open-source code for replication.

Abstract

Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding but suffer from prohibitive quadratic computational complexity and memory overhead during inference. Current caching techniques accelerate decoding by storing full-layer states, yet impose substantial memory usage that limit long-context applications. Our analysis of attention patterns in dLLMs reveals persistent cross-layer sparsity, with pivotal tokens remaining salient across decoding steps and low-relevance tokens staying unimportant, motivating selective cache eviction. We propose Sparse-dLLM, the first training-free framework integrating dynamic cache eviction with sparse attention via delayed bidirectional sparse caching. By leveraging the stability of token saliency over steps, it retains critical tokens and dynamically evicts unimportant prefix/suffix entries using an attention-guided strategy. Extensive experiments on LLaDA and Dream series demonstrate Sparse-dLLM achieves up to 10 higher throughput than vanilla dLLMs, with comparable performance and similar peak memory costs, outperforming previous methods in efficiency and effectiveness. The code is available at https://github.com/OpenMOSS/Sparse-dLLM.

Paper Structure

This paper contains 33 sections, 8 equations, 7 figures, 8 tables, 5 algorithms.

Figures (7)

  • Figure 1: Throughput vs. Accuracy across methods. Sparse-dLLM (ours) achieves the best throughput while maintaining or even improving performance of vanilla dLLMs.
  • Figure 2: Sparsity patterns in dLLM attention. Using LLaDA-8B-Instruct with L = 66, T = 32, and a block length of 32, we observe pronounced sparsity that persists across layers, with pivotal tokens remaining salient throughout decoding steps.
  • Figure 3: Overview of Sparse-dLLM.
  • Figure 4: L2-norm of KV state changes outside the current decoding block between adjacent step pairs.
  • Figure 5: Results of efficiency comparison in different context lengths. The missing data points indicate that the combination of the model and method resulted in an OOM error on the NVIDIA 4090 (48 GB) GPU. We adopt LLaDA-8B-Instruct (LLaDA) and Dream-v0-7B-Instruct (Dream).
  • ...and 2 more figures