Table of Contents
Fetching ...

Compactor: Calibrated Query-Agnostic KV Cache Compression with Approximate Leverage Scores

Vivek Chari, Benjamin Van Durme

TL;DR

Long-context LLM deployment is memory‑limited by KV caches. The paper introduces Compactor, a training‑free, query‑agnostic eviction mechanism that combines approximate leverage scores (outlier importance) with non‑causal attention to retain the most important KV tokens, and a context‑calibration framework to adapt retention to the given context and quality budget. Through extensive experiments on Llama 3.1 and Qwen 2.5 across RULER and Longbench, Compactor achieves near full KV performance at substantial compression, and context‑calibrated compression further stabilizes performance across tasks and budgets. The work also provides compactor‑vllm, an inference engine with optimized Triton kernels to make sparse, non‑contiguous KV access practical for real‑world deployment.

Abstract

Modern Large Language Models (LLMs) are increasingly trained to support very large context windows. We present Compactor, a training-free, query-agnostic KV compression strategy that uses approximate leverage scores to determine token importance. We show that Compactor can achieve the same performance as competing methods while retaining 20% fewer tokens in both synthetic and real-world context tasks, while being more task-robust. We further introduce a procedure for context-calibrated compression: inferring the maximum compression a given context supports before significant performance loss. Using context-calibrated compression, we show that Compactor achieves full KV performance on Longbench while reducing the KV memory burden by 68%, on average. To demonstrate the efficacy and generalizability of our approach, we apply Compactor to 27 synthetic and real-world tasks from RULER and Longbench, with models from both the Qwen 2.5 and Llama 3.1 families. Finally, we release compactor-vllm, an inference engine and suite of optimized Triton kernels designed to efficiently support the sparse, non-contiguous memory access patterns inherent to compressed KV caches. This work demonstrates that Compactor offers a practical, high-performance solution for alleviating the memory bottleneck in modern LLM deployment.

Compactor: Calibrated Query-Agnostic KV Cache Compression with Approximate Leverage Scores

TL;DR

Long-context LLM deployment is memory‑limited by KV caches. The paper introduces Compactor, a training‑free, query‑agnostic eviction mechanism that combines approximate leverage scores (outlier importance) with non‑causal attention to retain the most important KV tokens, and a context‑calibration framework to adapt retention to the given context and quality budget. Through extensive experiments on Llama 3.1 and Qwen 2.5 across RULER and Longbench, Compactor achieves near full KV performance at substantial compression, and context‑calibrated compression further stabilizes performance across tasks and budgets. The work also provides compactor‑vllm, an inference engine with optimized Triton kernels to make sparse, non‑contiguous KV access practical for real‑world deployment.

Abstract

Modern Large Language Models (LLMs) are increasingly trained to support very large context windows. We present Compactor, a training-free, query-agnostic KV compression strategy that uses approximate leverage scores to determine token importance. We show that Compactor can achieve the same performance as competing methods while retaining 20% fewer tokens in both synthetic and real-world context tasks, while being more task-robust. We further introduce a procedure for context-calibrated compression: inferring the maximum compression a given context supports before significant performance loss. Using context-calibrated compression, we show that Compactor achieves full KV performance on Longbench while reducing the KV memory burden by 68%, on average. To demonstrate the efficacy and generalizability of our approach, we apply Compactor to 27 synthetic and real-world tasks from RULER and Longbench, with models from both the Qwen 2.5 and Llama 3.1 families. Finally, we release compactor-vllm, an inference engine and suite of optimized Triton kernels designed to efficiently support the sparse, non-contiguous memory access patterns inherent to compressed KV caches. This work demonstrates that Compactor offers a practical, high-performance solution for alleviating the memory bottleneck in modern LLM deployment.

Paper Structure

This paper contains 23 sections, 3 theorems, 28 equations, 6 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

Let $\epsilon, \delta \in (0, 1)$ and data matrix $\mathbf{K} \in \mathbb{R}^{N \times d}$ be given and take $k\geq Cd\log(\frac{d}{\delta})\epsilon^{-2}$ for some universal constant $C$. Construct $\hat{\mathbf{K}}_k \in \mathbb{R}^{k \times d}$ by sampling $k$ times (with replacement) from the dis where $A \preccurlyeq B$ means that $B - A$ is PSD. In practice broadbent_subset_2010paschou_pca-co

Figures (6)

  • Figure 1: Example of non-causal attention matrices from heads 5, 6, 7 in layer 16 in Llama 3.1-8B Instruct. Lighter colors indicate higher attention scores. Note the columnar and diagonal structure in the non-causal upper-triangular regions of the matrices
  • Figure 2: Median wall-clock overhead of the selection mechanism for one layer of Llama 3.1 (batch size 1).
  • Figure 4: Mean RULER score (all tasks) on Llama 3.1 and Qwen 2.5 across KV retention rates.
  • Figure 5: NLL vs KV retention on 13 tasks from RULER, one line per sub-task. For each sub-task we compute the (reciprocal) gain in NLL of the ground-truth answer when conditioned on compressed contexts (for each compression method). This empirical trend motivates the exponential calibration function introduced in §\ref{['method:cal_compress']}. These figures are generated without head-adaptive compression.
  • Figure 6: The top bar shows the KV retention rates that each compression method induces when using context-calibrated compression on Longbench. The bottom bar shows the same when the LLM is finetuned on documents from the Longbench test set (no queries). In all cases, the performance of the compression methods is within 0.1 of full KV cache performance ($42.4\pm0.1$). Results are shown for Llama-3.1 8B.
  • ...and 1 more figures

Theorems & Definitions (6)

  • Definition 1: Leverage Score
  • Theorem 1: Spectral Preservation of Leverage Sampling
  • Definition 2: Subspace Embedding
  • Theorem 2: Approximate Leverage Scores
  • proof
  • Corollary 1