Table of Contents
Fetching ...

KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction

Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, Hyun Oh Song

TL;DR

KVzip introduces a query-agnostic KV cache eviction strategy that uses context-reconstruction as a proxy for KV importance. By simulating reconstruction with chunked cross-attention scoring, KVzip identifies a compact subset of KV pairs that preserves inference across diverse future queries, enabling offline prefill and reuse. Empirically, it delivers 3–4× KV-cache size reduction and about 2× decoding latency improvement with negligible task performance loss across long-context benchmarks and multiple models, including Qwen, LLaMA, and Gemma, up to 170K tokens. It also supports context-independent eviction and integrates with KV quantization, outperforming prior query-aware methods in multi-query settings and offering practical deployment benefits.

Abstract

Transformer-based large language models (LLMs) cache context as key-value (KV) pairs during inference. As context length grows, KV cache sizes expand, leading to substantial memory overhead and increased attention latency. This paper introduces KVzip, a query-agnostic KV cache eviction method enabling effective reuse of compressed KV caches across diverse queries. KVzip quantifies the importance of a KV pair using the underlying LLM to reconstruct original contexts from cached KV pairs, subsequently evicting pairs with lower importance. Extensive empirical evaluations demonstrate that KVzip reduces KV cache size by $3$-$4\times$ and FlashAttention decoding latency by approximately $2\times$, with negligible performance loss in question-answering, retrieval, reasoning, and code comprehension tasks. Evaluations include various models such as LLaMA3.1, Qwen2.5, and Gemma3, with context lengths reaching up to 170K tokens. KVzip significantly outperforms existing query-aware KV eviction methods, which suffer from performance degradation even at a 90% cache budget ratio under multi-query scenarios.

KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction

TL;DR

KVzip introduces a query-agnostic KV cache eviction strategy that uses context-reconstruction as a proxy for KV importance. By simulating reconstruction with chunked cross-attention scoring, KVzip identifies a compact subset of KV pairs that preserves inference across diverse future queries, enabling offline prefill and reuse. Empirically, it delivers 3–4× KV-cache size reduction and about 2× decoding latency improvement with negligible task performance loss across long-context benchmarks and multiple models, including Qwen, LLaMA, and Gemma, up to 170K tokens. It also supports context-independent eviction and integrates with KV quantization, outperforming prior query-aware methods in multi-query settings and offering practical deployment benefits.

Abstract

Transformer-based large language models (LLMs) cache context as key-value (KV) pairs during inference. As context length grows, KV cache sizes expand, leading to substantial memory overhead and increased attention latency. This paper introduces KVzip, a query-agnostic KV cache eviction method enabling effective reuse of compressed KV caches across diverse queries. KVzip quantifies the importance of a KV pair using the underlying LLM to reconstruct original contexts from cached KV pairs, subsequently evicting pairs with lower importance. Extensive empirical evaluations demonstrate that KVzip reduces KV cache size by - and FlashAttention decoding latency by approximately , with negligible performance loss in question-answering, retrieval, reasoning, and code comprehension tasks. Evaluations include various models such as LLaMA3.1, Qwen2.5, and Gemma3, with context lengths reaching up to 170K tokens. KVzip significantly outperforms existing query-aware KV eviction methods, which suffer from performance degradation even at a 90% cache budget ratio under multi-query scenarios.

Paper Structure

This paper contains 47 sections, 2 equations, 20 figures, 3 tables, 1 algorithm.

Figures (20)

  • Figure 1: Overview of KV eviction strategies in multi-query scenarios. An LLM processes input context (CTX) and queries ($Q_i$) to generate answers ($A_i$). Existing approaches, such as SnapKV snapkv and PyramidKV pyramid, evict context KV pairs based on immediate query information. (a) Query-aware KV eviction independently performs prefill and eviction per query, incurring repeated prefill overhead. (b) Reusing a query-dependent compressed cache leads to performance degradation for subsequent queries (\ref{['fig:prelim']}). (c) The proposed query-agnostic KV eviction framework compresses the KV cache only once during the initial prefill, enabling efficient reuse across diverse queries without repeated prefill or performance loss. Adapting existing methods to the query-agnostic framework still results in suboptimal performance due to a mismatch with their original designs (\ref{['sec:exp']}).
  • Figure 2: Accuracy on SQuAD using LLaMA3.1-8B. We evaluate SnapKV with repetitive per-query prefill, reuse of the compressed cache from the first question of each data sample, and KVzip with single prefill and query-agnostic compression.
  • Figure 3: Transformer LLM viewed as a context encoder-decoder. Each matrix cell indicates a KV pair. We use the prompt "Repeat the previous context:".
  • Figure 4: Method overview. KVzip evicts KV pairs with the lowest importance scores, accommodating both KV pair-level and head-level eviction adakvduo. System prompts are omitted for clarity.
  • Figure 5: Histogram comparing max attention scores received by KV pairs in $\text{KV}_c$ during prefill versus reconstruction stages, measured on SQuAD with LLaMA3.1-8B.
  • ...and 15 more figures