Table of Contents
Fetching ...

KeepKV: Achieving Periodic Lossless KV Cache Compression for Efficient LLM Inference

Yuxuan Tian, Zihan Wang, Yebo Peng, Aomufei Yuan, Zhiming Wang, Bairen Yi, Xin Liu, Yong Cui, Tong Yang

TL;DR

This paper tackles the memory bottleneck in LLM inference caused by expanding KV caches by proposing KeepKV, a theoretically grounded KV cache merging framework. It introduces Electoral Votes to record merge history and ZIP-Merging to achieve lossless, current-step compression, with EMA-based predictions to bound multi-step perturbations. The authors provide formal analysis of output perturbation, derive guarantees for single-step losslessness, and offer bounded multi-step error, extending the approach to long-context generation. Empirically, KeepKV delivers over 2x throughput gains and maintains high generation quality across standard and long-context benchmarks, outperforming both eviction and existing merging methods.

Abstract

Efficient inference of large language models (LLMs) is hindered by an ever-growing key-value (KV) cache, making KV cache compression a critical research direction. Traditional methods selectively evict less important KV cache entries, which leads to information loss and hallucinations. Recently, merging-based strategies have been explored to retain more information by merging KV pairs that would be discarded; however, these existing approaches inevitably introduce inconsistencies in attention distributions before and after merging, causing degraded generation quality. To overcome this challenge, we propose KeepKV, a novel adaptive KV cache merging method designed to preserve performance under strict memory constraints, achieving single-step lossless compression and providing error bounds for multi-step compression. KeepKV introduces the Electoral Votes mechanism that records merging history and adaptively adjusts attention scores. Moreover, it further leverages a novel Zero Inference-Perturbation Merging method, compensating for attention loss resulting from cache merging. Extensive experiments on various benchmarks and LLM architectures demonstrate that KeepKV substantially reduces memory usage while successfully retaining essential context information, achieving over 2x inference throughput improvement and maintaining superior generation quality even with only 10% KV cache budgets.

KeepKV: Achieving Periodic Lossless KV Cache Compression for Efficient LLM Inference

TL;DR

This paper tackles the memory bottleneck in LLM inference caused by expanding KV caches by proposing KeepKV, a theoretically grounded KV cache merging framework. It introduces Electoral Votes to record merge history and ZIP-Merging to achieve lossless, current-step compression, with EMA-based predictions to bound multi-step perturbations. The authors provide formal analysis of output perturbation, derive guarantees for single-step losslessness, and offer bounded multi-step error, extending the approach to long-context generation. Empirically, KeepKV delivers over 2x throughput gains and maintains high generation quality across standard and long-context benchmarks, outperforming both eviction and existing merging methods.

Abstract

Efficient inference of large language models (LLMs) is hindered by an ever-growing key-value (KV) cache, making KV cache compression a critical research direction. Traditional methods selectively evict less important KV cache entries, which leads to information loss and hallucinations. Recently, merging-based strategies have been explored to retain more information by merging KV pairs that would be discarded; however, these existing approaches inevitably introduce inconsistencies in attention distributions before and after merging, causing degraded generation quality. To overcome this challenge, we propose KeepKV, a novel adaptive KV cache merging method designed to preserve performance under strict memory constraints, achieving single-step lossless compression and providing error bounds for multi-step compression. KeepKV introduces the Electoral Votes mechanism that records merging history and adaptively adjusts attention scores. Moreover, it further leverages a novel Zero Inference-Perturbation Merging method, compensating for attention loss resulting from cache merging. Extensive experiments on various benchmarks and LLM architectures demonstrate that KeepKV substantially reduces memory usage while successfully retaining essential context information, achieving over 2x inference throughput improvement and maintaining superior generation quality even with only 10% KV cache budgets.

Paper Structure

This paper contains 29 sections, 7 theorems, 38 equations, 5 figures, 4 tables.

Key Result

Theorem 2

Current weighted merging (convex combination) methods reduce the merged KV pair's attention score compared to the sum of the original scores before merging, i.e., ${A'}_r^t < A_e^t + A_c^t$, ultimately leading to $\left\| o'_t - o_t \right\| > 0$.

Figures (5)

  • Figure 1: Illustration of KeepKV vs. Existing Methods. The three middle blocks represent KV subject to eviction/merging. (a) Eviction methods permanently discard them. (b) Merging methods integrates them into retained KV, but the result is not equivalent to the full KV, causing "Attention Sag." (c) Full KV serves as the ideal baseline. (d) KeepKV uses Electoral Votes as merging records and applies ZIP-Merging to minimize output disturbance, ensuring consistency and improving performance.
  • Figure 2: (a) Cumulative distribution of attention scores. Retaining the top-$k$ tokens does not always preserve the majority of scores. (b) Proportion of to-be-evicted prompt tokens appearing in the top-20% attention scores during generation (compression rate = 20%). (c) Each token's variance of its attention scores at each generation step (blue dots) is greater than the average variance within a sliding window (orange dots). (d) Relative errors for prediction of KeepKV and existing methods.
  • Figure 3: Illustrative example of KeepKV. (0) $(k_e, v_e)$ is selected for eviction by specific compression method. (1) The retained KV with the highest cosine similarity, $(k_c, v_c)$, is selected. (2) EMA attention scores are updated. (3) ZIP-Merging is performed. (4) Consequently, with the Electoral Votes, the compressed KV can preserve the influence of the original KV in attention computations.
  • Figure 4: Performance of KeepKV and other methods for LLama backbones on HELM and LM-Eval evaluations.
  • Figure 5: Accuracy experiments combining KeepKV with existing eviction methods.

Theorems & Definitions (12)

  • Remark 1
  • Theorem 2
  • Theorem 3
  • Remark 4
  • Theorem 5
  • Lemma 6
  • Theorem 7: Formal version of Theorem \ref{['thm:attn_collapse']}
  • proof
  • Theorem 8
  • Lemma 9
  • ...and 2 more