Table of Contents
Fetching ...

Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs

Ngoc Bui, Shubham Sharma, Simran Lamba, Saumitra Mishra, Rex Ying

TL;DR

TRIM-KV introduces a learnable retention-gated mechanism for KV-cache eviction in long-context LLMs. By assigning each token a decayable retention score, the model continuously evicts the least valuable tokens to stay within a fixed memory budget, while retaining high-utility tokens for longer horizons. Training combines distillation with a capacity loss to preserve output quality under memory constraints, and inference uses a simple, low-overhead eviction rule based on the learned scores. Across math reasoning, long procedural generation, and long-context benchmarks, TRIM-KV outperforms eviction baselines and, in some cases, matches or exceeds full-cache performance, while also providing interpretable insights into layer/head roles and emergent heuristics.

Abstract

Memory and computation remain core bottlenecks in long-horizon LLM inference due to the quadratic cost of self-attention and the ever-growing key-value (KV) cache. Existing strategies for memory-bounded inference, such as quantization, offloading, or heuristic KV eviction, either incur high orchestration costs or rely on unreliable attention-based proxies of importance. We propose TRIM-KV, a novel approach that learns each token's intrinsic importance at creation time via a lightweight retention gate. Each gate predicts a scalar retention score that decays over time, reflecting the long-term utility of the token for a specific layer and head. Tokens with low scores are evicted when the memory budget is exceeded, ensuring that the cache always contains the most critical tokens. TRIM-KV is trained efficiently through distillation from a frozen LLM combined with a capacity loss, requiring only gate fine-tuning and adding negligible inference overhead. Across mathematical reasoning (GSM8K, MATH-500, AIME24), procedural generation (LongProc), conversational long-memory benchmarks (LongMemEval), and long-context understanding (LongBench and SCBench), TRIM-KV consistently outperforms strong eviction and learnable retrieval baselines, especially in low-memory regimes. Remarkably, it even surpasses full-cache models in some settings, showing that selective retention can serve as a form of regularization, suppressing noise from uninformative tokens. Qualitative analyses further reveal that learned retention scores align with human intuition, naturally recovering heuristics such as sink tokens, sliding windows, and gist compression without explicit design. Beyond efficiency, retention scores provide insights into layer- and head-specific roles, suggesting a new path toward LLM interpretability.

Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs

TL;DR

TRIM-KV introduces a learnable retention-gated mechanism for KV-cache eviction in long-context LLMs. By assigning each token a decayable retention score, the model continuously evicts the least valuable tokens to stay within a fixed memory budget, while retaining high-utility tokens for longer horizons. Training combines distillation with a capacity loss to preserve output quality under memory constraints, and inference uses a simple, low-overhead eviction rule based on the learned scores. Across math reasoning, long procedural generation, and long-context benchmarks, TRIM-KV outperforms eviction baselines and, in some cases, matches or exceeds full-cache performance, while also providing interpretable insights into layer/head roles and emergent heuristics.

Abstract

Memory and computation remain core bottlenecks in long-horizon LLM inference due to the quadratic cost of self-attention and the ever-growing key-value (KV) cache. Existing strategies for memory-bounded inference, such as quantization, offloading, or heuristic KV eviction, either incur high orchestration costs or rely on unreliable attention-based proxies of importance. We propose TRIM-KV, a novel approach that learns each token's intrinsic importance at creation time via a lightweight retention gate. Each gate predicts a scalar retention score that decays over time, reflecting the long-term utility of the token for a specific layer and head. Tokens with low scores are evicted when the memory budget is exceeded, ensuring that the cache always contains the most critical tokens. TRIM-KV is trained efficiently through distillation from a frozen LLM combined with a capacity loss, requiring only gate fine-tuning and adding negligible inference overhead. Across mathematical reasoning (GSM8K, MATH-500, AIME24), procedural generation (LongProc), conversational long-memory benchmarks (LongMemEval), and long-context understanding (LongBench and SCBench), TRIM-KV consistently outperforms strong eviction and learnable retrieval baselines, especially in low-memory regimes. Remarkably, it even surpasses full-cache models in some settings, showing that selective retention can serve as a form of regularization, suppressing noise from uninformative tokens. Qualitative analyses further reveal that learned retention scores align with human intuition, naturally recovering heuristics such as sink tokens, sliding windows, and gist compression without explicit design. Beyond efficiency, retention scores provide insights into layer- and head-specific roles, suggesting a new path toward LLM interpretability.

Paper Structure

This paper contains 25 sections, 8 equations, 26 figures, 9 tables, 1 algorithm.

Figures (26)

  • Figure 1: Attention w/ eviction ($M=3$).
  • Figure 2: Training architecture.
  • Figure 3: Patero frontiers of competing algorithms with different budgets on math benchmarks.
  • Figure 4: Visualization of token retention score $\beta_i^{t-i}$ (top) and eviction decisions $\alpha_{ti}$ (bottom).
  • Figure 5: a) Average retention scores across all layers and heads of Qwen3-4B on tokens of an AIME24 example. b) Top 10 tokens with the highest (left table) and lowest (right table) average retention. c) The layer- and head-wise sparsity level estimated by token retentions.
  • ...and 21 more figures