Table of Contents
Fetching ...

Understanding the Physics of Key-Value Cache Compression for LLMs through Attention Dynamics

Samhruth Ananthanarayanan, Ayan Sengupta, Tanmoy Chakraborty

TL;DR

These results suggest sparse token-route structures govern compression tolerance, reframing KV compression as a structural probe of attention geometry and linking long-context scalability to sparsity and the lottery ticket hypothesis in self-attention.

Abstract

As context windows in LLMs scale to 100K+ tokens, the key-value (KV) cache becomes the dominant memory bottleneck, with recent methods claiming 80-90% savings and minimal benchmark degradation. We argue these evaluations miss a structural issue: attention is not just storage but routing, and retaining KV pairs does not guarantee semantic accessibility. We propose a physics-inspired view of KV compression as a controlled perturbation of token-level routing, distinguishing retention, accessibility, and utilization. Using synthetic tasks probing multi-entity tracking, disambiguation, coreference, and multi-hop reasoning, we find that moderate compression degrades internal representations with little accuracy loss, revealing redundancy; all models exhibit a sharp hallucination safety cliff near 90% compression, correlated with spikes in Global Eviction Ratio (GER), suggesting a phase transition in semantic reachability; and architectures differ in routing dynamics, with LLaMA showing early consensus and late diversification, and Qwen showing funnel-like late convergence, leading to distinct resilience profiles. Beyond erasure, we identify representational rigidity, where excessive head-level consensus collapses routing flexibility despite token survival. These results suggest sparse token-route structures govern compression tolerance, reframing KV compression as a structural probe of attention geometry and linking long-context scalability to sparsity and the lottery ticket hypothesis in self-attention.

Understanding the Physics of Key-Value Cache Compression for LLMs through Attention Dynamics

TL;DR

These results suggest sparse token-route structures govern compression tolerance, reframing KV compression as a structural probe of attention geometry and linking long-context scalability to sparsity and the lottery ticket hypothesis in self-attention.

Abstract

As context windows in LLMs scale to 100K+ tokens, the key-value (KV) cache becomes the dominant memory bottleneck, with recent methods claiming 80-90% savings and minimal benchmark degradation. We argue these evaluations miss a structural issue: attention is not just storage but routing, and retaining KV pairs does not guarantee semantic accessibility. We propose a physics-inspired view of KV compression as a controlled perturbation of token-level routing, distinguishing retention, accessibility, and utilization. Using synthetic tasks probing multi-entity tracking, disambiguation, coreference, and multi-hop reasoning, we find that moderate compression degrades internal representations with little accuracy loss, revealing redundancy; all models exhibit a sharp hallucination safety cliff near 90% compression, correlated with spikes in Global Eviction Ratio (GER), suggesting a phase transition in semantic reachability; and architectures differ in routing dynamics, with LLaMA showing early consensus and late diversification, and Qwen showing funnel-like late convergence, leading to distinct resilience profiles. Beyond erasure, we identify representational rigidity, where excessive head-level consensus collapses routing flexibility despite token survival. These results suggest sparse token-route structures govern compression tolerance, reframing KV compression as a structural probe of attention geometry and linking long-context scalability to sparsity and the lottery ticket hypothesis in self-attention.
Paper Structure (49 sections, 15 equations, 44 figures, 6 tables, 2 algorithms)

This paper contains 49 sections, 15 equations, 44 figures, 6 tables, 2 algorithms.

Figures (44)

  • Figure 1: Comparison of traditional versus synthetic evaluation frameworks for KV cache compression. Existing benchmarks (left) primarily report coarse-grained task accuracy, whereas our synthetic, controlled framework (right) probes structural reachability, routing collapse, and semantic fragility under compression.
  • Figure 2: Base task performance across compression levels. Localized spikes under moderate compression suggest sparse substructure effects. Corresponding results for LLaMA-3.2 3B, Qwen-2.5 3B, and Qwen-2.5 14B are shown in Figure \ref{['fig:BaseTaskPerfApp']} of Appendix \ref{['appx:result']}.
  • Figure 3: Knowledge manipulation results. Qwen exhibits a more gradual rate of degradation under compression, particularly in the question-aware setting.
  • Figure 4: Coreference performance across setups. Question-aware pruning significantly increases overconfident errors.
  • Figure 5: Multi presence forward vs. reverse asymmetry.
  • ...and 39 more figures