Table of Contents
Fetching ...

Fast KV Compaction via Attention Matching

Adam Zweiger, Xinghong Fu, Han Guo, Yoon Kim

TL;DR

This work describes an approach for fast context compaction in latent space through Attention Matching, which constructs compact keys and values to reproduce attention outputs and preserve attention mass at a per-KV-head level and develops a family of methods that significantly push the Pareto frontier of compaction time versus quality.

Abstract

Scaling language models to long contexts is often bottlenecked by the size of the key-value (KV) cache. In deployed settings, long contexts are typically managed through compaction in token space via summarization. However, summarization can be highly lossy, substantially harming downstream performance. Recent work on Cartridges has shown that it is possible to train highly compact KV caches in latent space that closely match full-context performance, but at the cost of slow and expensive end-to-end optimization. This work describes an approach for fast context compaction in latent space through Attention Matching, which constructs compact keys and values to reproduce attention outputs and preserve attention mass at a per-KV-head level. We show that this formulation naturally decomposes into simple subproblems, some of which admit efficient closed-form solutions. Within this framework, we develop a family of methods that significantly push the Pareto frontier of compaction time versus quality, achieving up to 50x compaction in seconds on some datasets with little quality loss.

Fast KV Compaction via Attention Matching

TL;DR

This work describes an approach for fast context compaction in latent space through Attention Matching, which constructs compact keys and values to reproduce attention outputs and preserve attention mass at a per-KV-head level and develops a family of methods that significantly push the Pareto frontier of compaction time versus quality.

Abstract

Scaling language models to long contexts is often bottlenecked by the size of the key-value (KV) cache. In deployed settings, long contexts are typically managed through compaction in token space via summarization. However, summarization can be highly lossy, substantially harming downstream performance. Recent work on Cartridges has shown that it is possible to train highly compact KV caches in latent space that closely match full-context performance, but at the cost of slow and expensive end-to-end optimization. This work describes an approach for fast context compaction in latent space through Attention Matching, which constructs compact keys and values to reproduce attention outputs and preserve attention mass at a per-KV-head level. We show that this formulation naturally decomposes into simple subproblems, some of which admit efficient closed-form solutions. Within this framework, we develop a family of methods that significantly push the Pareto frontier of compaction time versus quality, achieving up to 50x compaction in seconds on some datasets with little quality loss.
Paper Structure (62 sections, 25 equations, 10 figures, 7 tables, 4 algorithms)

This paper contains 62 sections, 25 equations, 10 figures, 7 tables, 4 algorithms.

Figures (10)

  • Figure 1: Accuracy vs. Compaction Time Trade-off (Qwen3-4B; QuALITY). We compare downstream QA accuracy ($n=894$) after compaction, plotted against the average wall-clock time required to compact a context (seconds, log-scale) using a single H100 GPU at a fixed 50$\times$ compaction ratio. Our attention-matching (AM) methods trace a speed--quality tradeoff and form the Pareto frontier, outperforming prior token-selection baselines and exceeding the performance of Cartridges eyuboglu2025cartridges while being 2 orders of magnitude faster; additional Cartridges training may further improve its results.
  • Figure 2: Head sensitivity curves in Qwen3-4B. We fix all KV heads to a baseline compaction ratio of $0.05\times$ and vary the compaction ratio of a single head. We report the change in loss relative to the baseline (lower is better) as a function of the varied head's compaction ratio. Curves are averaged over $10$ QuALITY articles; shaded regions denote $\pm$1 standard error of the mean across articles. Some heads (e.g., L0H0) are largely insensitive to additional capacity, whereas others (e.g., L15H2) benefit substantially from storing more KV pairs.
  • Figure 3: Accuracy vs. compaction ratio across methods. We compare AM-OMP and AM-HighestAttentionKeys against Cartridges, summarization, and four prior methods. Evaluations are conducted on QuALITY and LongHealth using Qwen3-4B, Llama3.1-8B, and Gemma3-12B. Attention Matching (AM) consistently outperforms other approaches across compaction ratios, while matching Cartridges’ performance at ultra-high compaction.
  • Figure 4: Leave-one-out experiments. We ablate our main AM-OMP method and measure average log-perplexity of generations decoded from the original cache, since this is lower in variance and well-correlated with downstream performance (Appendix \ref{['app:perplexity']}). In "no biases," we first compute OMP as usual and then zero out $\bm{\beta}$, keeping the keys selected by OMP.
  • Figure 5: Reference query sampling comparison. We compare eight variants for sampling queries. Self-study-based methods perform best, especially at the greatest compaction ratios, with repeat and context-prefill close behind. Subsampling reference queries preserves performance.
  • ...and 5 more figures