Compactor: Calibrated Query-Agnostic KV Cache Compression with Approximate Leverage Scores
Vivek Chari, Benjamin Van Durme
TL;DR
Long-context LLM deployment is memory‑limited by KV caches. The paper introduces Compactor, a training‑free, query‑agnostic eviction mechanism that combines approximate leverage scores (outlier importance) with non‑causal attention to retain the most important KV tokens, and a context‑calibration framework to adapt retention to the given context and quality budget. Through extensive experiments on Llama 3.1 and Qwen 2.5 across RULER and Longbench, Compactor achieves near full KV performance at substantial compression, and context‑calibrated compression further stabilizes performance across tasks and budgets. The work also provides compactor‑vllm, an inference engine with optimized Triton kernels to make sparse, non‑contiguous KV access practical for real‑world deployment.
Abstract
Modern Large Language Models (LLMs) are increasingly trained to support very large context windows. We present Compactor, a training-free, query-agnostic KV compression strategy that uses approximate leverage scores to determine token importance. We show that Compactor can achieve the same performance as competing methods while retaining 20% fewer tokens in both synthetic and real-world context tasks, while being more task-robust. We further introduce a procedure for context-calibrated compression: inferring the maximum compression a given context supports before significant performance loss. Using context-calibrated compression, we show that Compactor achieves full KV performance on Longbench while reducing the KV memory burden by 68%, on average. To demonstrate the efficacy and generalizability of our approach, we apply Compactor to 27 synthetic and real-world tasks from RULER and Longbench, with models from both the Qwen 2.5 and Llama 3.1 families. Finally, we release compactor-vllm, an inference engine and suite of optimized Triton kernels designed to efficiently support the sparse, non-contiguous memory access patterns inherent to compressed KV caches. This work demonstrates that Compactor offers a practical, high-performance solution for alleviating the memory bottleneck in modern LLM deployment.
