Understanding the Physics of Key-Value Cache Compression for LLMs through Attention Dynamics

Samhruth Ananthanarayanan; Ayan Sengupta; Tanmoy Chakraborty

Understanding the Physics of Key-Value Cache Compression for LLMs through Attention Dynamics

Samhruth Ananthanarayanan, Ayan Sengupta, Tanmoy Chakraborty

TL;DR

These results suggest sparse token-route structures govern compression tolerance, reframing KV compression as a structural probe of attention geometry and linking long-context scalability to sparsity and the lottery ticket hypothesis in self-attention.

Abstract

As context windows in LLMs scale to 100K+ tokens, the key-value (KV) cache becomes the dominant memory bottleneck, with recent methods claiming 80-90% savings and minimal benchmark degradation. We argue these evaluations miss a structural issue: attention is not just storage but routing, and retaining KV pairs does not guarantee semantic accessibility. We propose a physics-inspired view of KV compression as a controlled perturbation of token-level routing, distinguishing retention, accessibility, and utilization. Using synthetic tasks probing multi-entity tracking, disambiguation, coreference, and multi-hop reasoning, we find that moderate compression degrades internal representations with little accuracy loss, revealing redundancy; all models exhibit a sharp hallucination safety cliff near 90% compression, correlated with spikes in Global Eviction Ratio (GER), suggesting a phase transition in semantic reachability; and architectures differ in routing dynamics, with LLaMA showing early consensus and late diversification, and Qwen showing funnel-like late convergence, leading to distinct resilience profiles. Beyond erasure, we identify representational rigidity, where excessive head-level consensus collapses routing flexibility despite token survival. These results suggest sparse token-route structures govern compression tolerance, reframing KV compression as a structural probe of attention geometry and linking long-context scalability to sparsity and the lottery ticket hypothesis in self-attention.

Understanding the Physics of Key-Value Cache Compression for LLMs through Attention Dynamics

TL;DR

Abstract

Paper Structure (49 sections, 15 equations, 44 figures, 6 tables, 2 algorithms)

This paper contains 49 sections, 15 equations, 44 figures, 6 tables, 2 algorithms.

Introduction
Background
Self-Attention and the KV Cache in Autoregressive Decoding
Efficient Attention Mechanisms
IO-Aware Exact Attention and KV Sharing Mechanisms
KV Cache Compression
Methodology
Datasets and Controlled KV Compression Evaluation
Design principles.
Generation framework.
Dataset suite.
Dataset statistics.
Significance of the generated synthetic datasets.
Tagging Framework
Experimental Setup
...and 34 more sections

Figures (44)

Figure 1: Comparison of traditional versus synthetic evaluation frameworks for KV cache compression. Existing benchmarks (left) primarily report coarse-grained task accuracy, whereas our synthetic, controlled framework (right) probes structural reachability, routing collapse, and semantic fragility under compression.
Figure 2: Base task performance across compression levels. Localized spikes under moderate compression suggest sparse substructure effects. Corresponding results for LLaMA-3.2 3B, Qwen-2.5 3B, and Qwen-2.5 14B are shown in Figure \ref{['fig:BaseTaskPerfApp']} of Appendix \ref{['appx:result']}.
Figure 3: Knowledge manipulation results. Qwen exhibits a more gradual rate of degradation under compression, particularly in the question-aware setting.
Figure 4: Coreference performance across setups. Question-aware pruning significantly increases overconfident errors.
Figure 5: Multi presence forward vs. reverse asymmetry.
...and 39 more figures

Understanding the Physics of Key-Value Cache Compression for LLMs through Attention Dynamics

TL;DR

Abstract

Understanding the Physics of Key-Value Cache Compression for LLMs through Attention Dynamics

Authors

TL;DR

Abstract

Table of Contents

Figures (44)