Table of Contents
Fetching ...

Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution

Alessio Devoto, Maximilian Jeblick, Simon Jégou

TL;DR

This work tackles the memory bottleneck of KV caches in long-context LLMs by introducing Expected Attention, a training-free method that estimates future attention from the distribution of future queries. By deriving a closed-form, Gaussian-based model for expected attention scores, it computes an expected KV contribution to the residual stream and uses this to prune KV entries with minimal performance loss. The approach, including head-adaptive compression, yields strong memory savings across prefilling and decoding on multiple models and benchmarks, without architectural changes or retraining. The authors accompany their method with KVPress, a PyTorch-based library and public leaderboard to enable standardized benchmarking of KV cache compression methods in long-context inference.

Abstract

Memory consumption of the Key-Value (KV) cache represents a major bottleneck for efficient large language model inference. While attention-score-based KV cache pruning shows promise, it faces critical practical limitations: attention scores from future tokens are unavailable during compression, and modern implementations like Flash Attention do not materialize the full attention matrix, making past scores inaccessible. To overcome these challenges, we introduce $\textbf{Expected Attention, a training-free compression method}$ that estimates KV pairs importance by predicting how future queries will attend to them. Our approach leverages the distributional properties of LLM activations to compute expected attention scores in closed form for each KV pair. These scores enable principled ranking and pruning of KV pairs with minimal impact on the residual stream, achieving effective compression without performance degradation. Importantly, our method operates seamlessly across both prefilling and decoding phases, consistently outperforming state-of-the-art baselines in both scenarios. Finally, $\textbf{we release KVPress, a comprehensive library to enable researchers to implement and benchmark KV cache compression methods, already including more than 20 techniques}$.

Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution

TL;DR

This work tackles the memory bottleneck of KV caches in long-context LLMs by introducing Expected Attention, a training-free method that estimates future attention from the distribution of future queries. By deriving a closed-form, Gaussian-based model for expected attention scores, it computes an expected KV contribution to the residual stream and uses this to prune KV entries with minimal performance loss. The approach, including head-adaptive compression, yields strong memory savings across prefilling and decoding on multiple models and benchmarks, without architectural changes or retraining. The authors accompany their method with KVPress, a PyTorch-based library and public leaderboard to enable standardized benchmarking of KV cache compression methods in long-context inference.

Abstract

Memory consumption of the Key-Value (KV) cache represents a major bottleneck for efficient large language model inference. While attention-score-based KV cache pruning shows promise, it faces critical practical limitations: attention scores from future tokens are unavailable during compression, and modern implementations like Flash Attention do not materialize the full attention matrix, making past scores inaccessible. To overcome these challenges, we introduce that estimates KV pairs importance by predicting how future queries will attend to them. Our approach leverages the distributional properties of LLM activations to compute expected attention scores in closed form for each KV pair. These scores enable principled ranking and pruning of KV pairs with minimal impact on the residual stream, achieving effective compression without performance degradation. Importantly, our method operates seamlessly across both prefilling and decoding phases, consistently outperforming state-of-the-art baselines in both scenarios. Finally, .

Paper Structure

This paper contains 35 sections, 9 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Hidden states from layer 16 and 20 and corresponding queries for layer 20 in Llama3.1-8B. Hidden states in modern LLMs are mostly normally distributed. As a consequence, query activations also follow a Normal. The best Gaussian fit is overlayed. We show more examples and discuss this property in \ref{['app:adistributions']}.
  • Figure 2: Scores on LongBench longbench for Qwen3-8B (top) and Gemma3-12B (bottom). The x-axis represents the compression ratio, the y-axis the score for each specific dataset. The horizontal line represents the baseline performance without cache compression. Expected Attention achieves optimal trade-off between compression ratio and scores across most datasets (Additional and averaged results in \ref{['app:more_results']}).
  • Figure 3: Needle in the Haystack test for different methods with Llama3.1-8B and 50% compression ratio.
  • Figure 4: Decoding results on Aime25 dataset, different markers represent different models sizes. The x-axis is the maximum size that the KV cache is allowed to grow to.
  • Figure 5: Reconstruction error $\| h - h_{\text{compr}} \|$
  • ...and 6 more figures