Table of Contents
Fetching ...

dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching

Zhiyuan Liu, Yicun Yang, Yaojie Zhang, Junjie Chen, Chang Zou, Qingyuan Wei, Shaobo Wang, Linfeng Zhang

TL;DR

The paper tackles the high latency of diffusion-based LLMs by introducing dLLM-Cache, a training-free adaptive caching framework that separately handles the static prompt and the dynamic response. It leverages long-interval prompt caching and adaptive short-interval response caching guided by a V-verify mechanism based on Value-vector similarity to selectively recompute only the most changed tokens. Empirical results on LLaDA 8B and Dream 7B show up to 9.1x speedups with lossless quality in many tasks, bringing dLLM inference closer to autoregressive models in latency. The approach is model-agnostic, memory-efficient, and comes with public code, representing a practical pathway to accelerating diffusion-based language models without retraining.

Abstract

Autoregressive Models (ARMs) have long dominated the landscape of Large Language Models. Recently, a new paradigm has emerged in the form of diffusion-based Large Language Models (dLLMs), which generate text by iteratively denoising masked segments. This approach has shown significant advantages and potential. However, dLLMs suffer from high inference latency. Traditional ARM acceleration techniques, such as Key-Value caching, are incompatible with dLLMs due to their bidirectional attention mechanism. To address this specific challenge, our work begins with a key observation that dLLM inference involves a static prompt and a partially dynamic response, where most tokens remain stable across adjacent denoising steps. Based on this, we propose dLLM-Cache, a training-free adaptive caching framework that combines long-interval prompt caching with partial response updates guided by feature similarity. This design enables efficient reuse of intermediate computations without compromising model performance. Extensive experiments on representative dLLMs, including LLaDA 8B and Dream 7B, show that dLLM-Cache achieves up to 9.1 x speedup over standard inference without compromising output quality. Notably, our method brings dLLM inference latency close to that of ARMs under many settings. Codes are provided in the supplementary material and will be released publicly on GitHub.

dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching

TL;DR

The paper tackles the high latency of diffusion-based LLMs by introducing dLLM-Cache, a training-free adaptive caching framework that separately handles the static prompt and the dynamic response. It leverages long-interval prompt caching and adaptive short-interval response caching guided by a V-verify mechanism based on Value-vector similarity to selectively recompute only the most changed tokens. Empirical results on LLaDA 8B and Dream 7B show up to 9.1x speedups with lossless quality in many tasks, bringing dLLM inference closer to autoregressive models in latency. The approach is model-agnostic, memory-efficient, and comes with public code, representing a practical pathway to accelerating diffusion-based language models without retraining.

Abstract

Autoregressive Models (ARMs) have long dominated the landscape of Large Language Models. Recently, a new paradigm has emerged in the form of diffusion-based Large Language Models (dLLMs), which generate text by iteratively denoising masked segments. This approach has shown significant advantages and potential. However, dLLMs suffer from high inference latency. Traditional ARM acceleration techniques, such as Key-Value caching, are incompatible with dLLMs due to their bidirectional attention mechanism. To address this specific challenge, our work begins with a key observation that dLLM inference involves a static prompt and a partially dynamic response, where most tokens remain stable across adjacent denoising steps. Based on this, we propose dLLM-Cache, a training-free adaptive caching framework that combines long-interval prompt caching with partial response updates guided by feature similarity. This design enables efficient reuse of intermediate computations without compromising model performance. Extensive experiments on representative dLLMs, including LLaDA 8B and Dream 7B, show that dLLM-Cache achieves up to 9.1 x speedup over standard inference without compromising output quality. Notably, our method brings dLLM inference latency close to that of ARMs under many settings. Codes are provided in the supplementary material and will be released publicly on GitHub.

Paper Structure

This paper contains 18 sections, 8 equations, 6 figures, 4 tables, 6 algorithms.

Figures (6)

  • Figure 1: Cosine similarity of Key, Value, Attention Output and FFN Output between two adjacent denoising steps in a dLLM, highlighting computational redundancies. The heatmaps show similarity across adjacent steps for prompt and response tokens, respectively, where a lighter color indicates a higher similarity of a token compared with its value in the last step. These results demonstrate: (I) The prompt region exhibits high similarity, while the response region shows different similarity in different tokens. (II) Notably, only a small fraction of response tokens exhibit significantly lower similarity, suggesting that selective recomputation is sufficient. (III) Response tokens' value similarity closely aligns with attention and FFN output similarity, supporting that value changes can serve as an effective indicator to identify those most changed response tokens.
  • Figure 2: Correlation of response tokens' $\mathbf{K}$ or $\mathbf{V}$ changes with other feature changes. We calculate the cosine similarity between the response tokens' $\mathbf{K}$ or $\mathbf{V}$ vectors and their cached counterparts at adjacent steps, select the 25% most dissimilar tokens, and compute the correlation between their similarity with (a) and (c) $\mathbf{AttnOut}$, or (b) and (d) $\mathbf{FFNOut}$ across adjacent steps.
  • Figure 3: The dLLM-Cache pipeline. Prompt features are updated with long intervals, while response features are updated adaptively based on the similarity between newly computed and cached $\mathbf{V}$ vectors. Response features of tokens with low similarity are updated, and the rest are reused.
  • Figure 4: Effect of cache refresh intervals using LLaDA 8B Instruct. (a) Varying $K_p$ with $K_r =1$, $\rho = 0$. (b) Varying $K_r$ under two settings: baseline with $K_p=1$, $\rho=0$ in gary and our setup $K_p=50$, $\rho=0.25$ in Table \ref{['tab:main_table_combined']}.
  • Figure 5: Effect of token selection strategy on GSM8K using LLaDA 8B Instruct model under varying update ratios $\rho$.
  • ...and 1 more figures

Theorems & Definitions (1)

  • proof