Table of Contents
Fetching ...

The Residual Stream Is All You Need: On the Redundancy of the KV Cache in Transformer Inference

Kaleem Ullah Qasim, Jiashu Zhang, Muhammad Kafeel Shaheen, Razan Alharith, Heying Zhang

Abstract

The key-value (KV) cache is widely treated as essential state in transformer inference, and a large body of work engineers policies to compress, evict, or approximate its entries. We prove that this state is entirely redundant: keys and values at every layer are deterministic projections of the residual stream, and recomputing them from a single residual vector per token incurs exactly zero reconstruction error, not approximately, but bit-identically. We verify this across six models from four architecture families (135M to 4B parameters). Cross-task residual patching at every layer produces D_KL = 0 between patched and original output distributions, confirming that the residual stream satisfies a Markov property and is the sole information-carrying state. Removing the cache entirely and recomputing from scratch yields token-identical output under greedy decoding on all models tested. We build on this result with KV-Direct, a bounded-memory inference scheme that checkpoints residual vectors (5 KB per token on Gemma 3-4B) instead of full KV pairs (136 KB), recomputing keys and values on demand. Over 20 conversation turns, KV-Direct holds peak memory at 42 MB while the standard cache grows past 103 MB. Against five eviction baselines (H2O, StreamingLLM, SnapKV, TOVA, window-only), KV-Direct maintains 100% token match at every cache budget; all baselines degrade to 5-28%. A per-operation latency analysis shows recomputation runs up to 5x faster than reading cached tensors at moderate batch sizes. Code is available at https://github.com/Kaleemullahqasim/KV-Direct.

The Residual Stream Is All You Need: On the Redundancy of the KV Cache in Transformer Inference

Abstract

The key-value (KV) cache is widely treated as essential state in transformer inference, and a large body of work engineers policies to compress, evict, or approximate its entries. We prove that this state is entirely redundant: keys and values at every layer are deterministic projections of the residual stream, and recomputing them from a single residual vector per token incurs exactly zero reconstruction error, not approximately, but bit-identically. We verify this across six models from four architecture families (135M to 4B parameters). Cross-task residual patching at every layer produces D_KL = 0 between patched and original output distributions, confirming that the residual stream satisfies a Markov property and is the sole information-carrying state. Removing the cache entirely and recomputing from scratch yields token-identical output under greedy decoding on all models tested. We build on this result with KV-Direct, a bounded-memory inference scheme that checkpoints residual vectors (5 KB per token on Gemma 3-4B) instead of full KV pairs (136 KB), recomputing keys and values on demand. Over 20 conversation turns, KV-Direct holds peak memory at 42 MB while the standard cache grows past 103 MB. Against five eviction baselines (H2O, StreamingLLM, SnapKV, TOVA, window-only), KV-Direct maintains 100% token match at every cache budget; all baselines degrade to 5-28%. A per-operation latency analysis shows recomputation runs up to 5x faster than reading cached tensors at moderate batch sizes. Code is available at https://github.com/Kaleemullahqasim/KV-Direct.
Paper Structure (43 sections, 3 theorems, 22 equations, 9 figures, 4 tables, 1 algorithm)

This paper contains 43 sections, 3 theorems, 22 equations, 9 figures, 4 tables, 1 algorithm.

Key Result

Proposition 3.1

For a pre-norm transformer with $L$ layers, the output distribution $p(x_{t+1} \mid x_{\leq t})$ is a deterministic function of $\{\mathbf{h}^{(\ell)}_p\}_{p=1}^t$ for any $\ell \in \{0, 1, \ldots, L\}$. It follows that the KV cache carries zero additional information:

Figures (9)

  • Figure 1: Three inference regimes compared. (Left) Standard KV cache: stores all K/V pairs, memory grows as $O(T)$ with sequence length. (Centre) Sliding window eviction: bounds memory to the last $B$ tokens but permanently discards evicted KV entries, yielding 5--28% token match and high KL divergence. (Right) KV-Direct: evicted KV entries are replaced by residual stream checkpoints (5 KB/token for Gemma3-4B), from which exact K and V are recomputed on the fly, achieving bounded memory with 100% token match and $D_{\mathrm{KL}} \approx 0$.
  • Figure 2: Proportional square visualisation of per-token memory anatomy. The outer grey square represents the full KV cache footprint; the inner blue square represents the residual stream checkpoint, sized proportionally by area. The visual disparity between the two directly encodes the memory inflation ratio (shown in red above each model).
  • Figure 3: Multi-turn inference evaluation. (a) Memory growth over 20 conversation turns: standard KV cache grows to 103 MB while KV-Direct stabilises at 42 MB. (b) Latency per turn: both methods track nearly identically, confirming zero inference penalty from residual checkpointing. (c) Per-token memory across all six models: the KV cache costs $7$--$27\times$ more than a single residual checkpoint.
  • Figure 4: Performance matrix across seven methods, five cache budgets, and two models. Top row: Token match percentage (higher is better; darker blue $=$ higher match). Bottom row: KL divergence from the full-cache output distribution (lower is better; blue $=$ near-zero divergence, red $=$ high divergence). KV-Direct and full KV cache achieve 100% token match and $\approx$0 KL divergence at every budget, while all five eviction baselines degrade severely (5--28% match, KL 7--14). The blue-bordered row highlights KV-Direct.
  • Figure 5: Effective rank of $\mathbf{M}^{(h)} = \mathbf{W}_q^{(h)}{\mathbf{W}_k^{(h)}}^\top$ at 90% spectral energy across three models. Each dot is one KV head at one layer. Colour: rank as a fraction of $d_{\text{head}}$ (blue $=$ compressible, red $=$ near full rank). Size: same fraction (larger $=$ higher rank). Dashed outlines on Gemma mark global-attention layers. Layer 0 consistently shows near-rank-1 heads across all models, consistent with the BOS-focus phenomenon xiao2024efficient. Rank heterogeneity is visible both within and across architectures.
  • ...and 4 more figures

Theorems & Definitions (6)

  • Definition 3.1: Residual Markov Property
  • Proposition 3.1: Residual Sufficiency
  • Proposition 3.2: Exact KV Reconstruction
  • proof
  • Corollary 3.3: Zero Conditional Entropy
  • Remark 5.1