Table of Contents
Fetching ...

EntropyCache: Decoded Token Entropy Guided KV Caching for Diffusion Language Models

Minsoo Cheong, Donghyun Son, Woosang Lim, Sungjoo Yoo

Abstract

Diffusion-based large language models (dLLMs) rely on bidirectional attention, which prevents lossless KV caching and requires a full forward pass at every denoising step. Existing approximate KV caching methods reduce this cost by selectively updating cached states, but their decision overhead scales with context length or model depth. We propose EntropyCache, a training-free KV caching method that uses the maximum entropy of newly decoded token distributions as a constant-cost signal for deciding when to recompute. Our design is grounded in two empirical observations: (1) decoded token entropy correlates with KV cache drift, providing a cheap proxy for cache staleness, and (2) feature volatility of decoded tokens persists for multiple steps after unmasking, motivating recomputation of the $k$ most recently decoded tokens. The skip-or-recompute decision requires only $O(V)$ computation per step, independent of context length and model scale. Experiments on LLaDA-8B-Instruct and Dream-7B-Instruct show that EntropyCache achieves $15.2\times$-$26.4\times$ speedup on standard benchmarks and $22.4\times$-$24.1\times$ on chain-of-thought benchmarks, with competitive accuracy and decision overhead accounting for only $0.5\%$ of inference time. Code is available at https://github.com/mscheong01/EntropyCache.

EntropyCache: Decoded Token Entropy Guided KV Caching for Diffusion Language Models

Abstract

Diffusion-based large language models (dLLMs) rely on bidirectional attention, which prevents lossless KV caching and requires a full forward pass at every denoising step. Existing approximate KV caching methods reduce this cost by selectively updating cached states, but their decision overhead scales with context length or model depth. We propose EntropyCache, a training-free KV caching method that uses the maximum entropy of newly decoded token distributions as a constant-cost signal for deciding when to recompute. Our design is grounded in two empirical observations: (1) decoded token entropy correlates with KV cache drift, providing a cheap proxy for cache staleness, and (2) feature volatility of decoded tokens persists for multiple steps after unmasking, motivating recomputation of the most recently decoded tokens. The skip-or-recompute decision requires only computation per step, independent of context length and model scale. Experiments on LLaDA-8B-Instruct and Dream-7B-Instruct show that EntropyCache achieves - speedup on standard benchmarks and - on chain-of-thought benchmarks, with competitive accuracy and decision overhead accounting for only of inference time. Code is available at https://github.com/mscheong01/EntropyCache.
Paper Structure (50 sections, 3 equations, 11 figures, 9 tables, 2 algorithms)

This paper contains 50 sections, 3 equations, 11 figures, 9 tables, 2 algorithms.

Figures (11)

  • Figure 1: (a) Entropy and cosine distance metrics per decoding step experimented on single gsm8k sample using LLADA-8B-Instruct model. (b) Max decoded token entropy vs. avg. value vector cosine distance, plotted on log–log axes.
  • Figure 2: (a)--(b) PCA projections (PC 1 vs. PC 2) of the last-layer value vectors for two mask tokens over 256 denoising steps in LLADA-8B-Instruct on a single GSM8K sample. Color encodes step progression (dark$\to$light); the red star marks the decoding step.
  • Figure 3: Overview of EntropyCache at a single denoising step $t$. Phase 1: a full or partial forward pass is executed depending on the recompute flag; in the partial case, only mask tokens and recently decoded tokens are recomputed while the rest are read from cache. Phase 2: new tokens are decoded from the model logits and the maximum entropy $E^{t+1}$ of the decoded distributions is computed. Phase 3: the entropy is compared against threshold $\tau$ to set the recompute flag for the next step, and the $k$ most recently decoded tokens $\mathcal{R}^{t+1}$ are selected for partial recomputation.
  • Figure 4: Accuracy--throughput tradeoff on GSM8K (LLaDA-8B-Instruct) for three candidate skipping metrics across varying thresholds $\tau$.
  • Figure 5: Grid search over entropy threshold $\tau$ and recent-token budget $k_{\text{recent}}$ on GSM8K. (a, d) Accuracy. (b, e) Throughput. (c, f) Accuracy vs. throughput. Top: LLaDA; bottom: Dream.
  • ...and 6 more figures