KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction
Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, Hyun Oh Song
TL;DR
KVzip introduces a query-agnostic KV cache eviction strategy that uses context-reconstruction as a proxy for KV importance. By simulating reconstruction with chunked cross-attention scoring, KVzip identifies a compact subset of KV pairs that preserves inference across diverse future queries, enabling offline prefill and reuse. Empirically, it delivers 3–4× KV-cache size reduction and about 2× decoding latency improvement with negligible task performance loss across long-context benchmarks and multiple models, including Qwen, LLaMA, and Gemma, up to 170K tokens. It also supports context-independent eviction and integrates with KV quantization, outperforming prior query-aware methods in multi-query settings and offering practical deployment benefits.
Abstract
Transformer-based large language models (LLMs) cache context as key-value (KV) pairs during inference. As context length grows, KV cache sizes expand, leading to substantial memory overhead and increased attention latency. This paper introduces KVzip, a query-agnostic KV cache eviction method enabling effective reuse of compressed KV caches across diverse queries. KVzip quantifies the importance of a KV pair using the underlying LLM to reconstruct original contexts from cached KV pairs, subsequently evicting pairs with lower importance. Extensive empirical evaluations demonstrate that KVzip reduces KV cache size by $3$-$4\times$ and FlashAttention decoding latency by approximately $2\times$, with negligible performance loss in question-answering, retrieval, reasoning, and code comprehension tasks. Evaluations include various models such as LLaMA3.1, Qwen2.5, and Gemma3, with context lengths reaching up to 170K tokens. KVzip significantly outperforms existing query-aware KV eviction methods, which suffer from performance degradation even at a 90% cache budget ratio under multi-query scenarios.
