Table of Contents
Fetching ...

KVCrush: Key value cache size-reduction using similarity in head-behaviour

Gopi Krishna Jha, Sameh Gobriel, Liubov Talamanova, Nilesh Jain

TL;DR

KVCrush tackles the KV cache memory bottleneck in large language models by introducing a binary per-head token representation derived from attention patterns and an anchor-based grouping mechanism to retain representative proxies for evicted tokens. This modular approach is designed to be complementary to existing KV compression techniques and requires no retraining or architectural changes. Empirically, it achieves up to 4× KV cache reduction with less than 1% accuracy loss and under 0.5% additional latency, outperforming several state-of-the-art eviction schemes and remaining compatible with paged KV deployments and mixed-precision caches. The work demonstrates practical benefits for memory-constrained, high-throughput LLM inference, with strong potential for dynamic budgeting and multi-anchor enhancements in future work.

Abstract

Key-value (KV) caching has emerged as a crucial optimization technique for accelerating inference in large language models (LLMs). By allowing the attention operation to scale linearly rather than quadratically with the total sequence length, KV caching significantly enhances generation throughput. However, due to large context lengths in the modern LLMs, the memory footprint of the KV is a huge bottleneck for model deployment directly impacting the model's batch size, hindering its ability to deliver high-throughput. Existing research addresses this challenge using several techniques, such as discarding low-attention tokens, quantization, and matrix approximation which typically lead to a negative impact on the model accuracy. In this paper, We propose KVCrush technology which can be combined with many KV compression technologies to improve the model accuracy at a much smaller memory. KVCrush provides an alternate representation scheme for key-value states, along with a low-overhead token pruning algorithm that accounts for the token distribution in the KV cache, which in turn allows for a a smaller footprint while maintaining the accuracy of the model. Based on our results, KVCrush reduces LongBench KV Cache size by 4x with less than 1% accuracy drop and achieves state-of-the-art average accuracy with minimal overhead, incurring less than 0.5% total inference latency. KVCrush not only outperforms the accuracy of state-of-the-art importance-based token retention schemes but is also compatible with typical practical LLM deployments using KV cache paging schemes such as vLLM and mixed precision quantization.

KVCrush: Key value cache size-reduction using similarity in head-behaviour

TL;DR

KVCrush tackles the KV cache memory bottleneck in large language models by introducing a binary per-head token representation derived from attention patterns and an anchor-based grouping mechanism to retain representative proxies for evicted tokens. This modular approach is designed to be complementary to existing KV compression techniques and requires no retraining or architectural changes. Empirically, it achieves up to 4× KV cache reduction with less than 1% accuracy loss and under 0.5% additional latency, outperforming several state-of-the-art eviction schemes and remaining compatible with paged KV deployments and mixed-precision caches. The work demonstrates practical benefits for memory-constrained, high-throughput LLM inference, with strong potential for dynamic budgeting and multi-anchor enhancements in future work.

Abstract

Key-value (KV) caching has emerged as a crucial optimization technique for accelerating inference in large language models (LLMs). By allowing the attention operation to scale linearly rather than quadratically with the total sequence length, KV caching significantly enhances generation throughput. However, due to large context lengths in the modern LLMs, the memory footprint of the KV is a huge bottleneck for model deployment directly impacting the model's batch size, hindering its ability to deliver high-throughput. Existing research addresses this challenge using several techniques, such as discarding low-attention tokens, quantization, and matrix approximation which typically lead to a negative impact on the model accuracy. In this paper, We propose KVCrush technology which can be combined with many KV compression technologies to improve the model accuracy at a much smaller memory. KVCrush provides an alternate representation scheme for key-value states, along with a low-overhead token pruning algorithm that accounts for the token distribution in the KV cache, which in turn allows for a a smaller footprint while maintaining the accuracy of the model. Based on our results, KVCrush reduces LongBench KV Cache size by 4x with less than 1% accuracy drop and achieves state-of-the-art average accuracy with minimal overhead, incurring less than 0.5% total inference latency. KVCrush not only outperforms the accuracy of state-of-the-art importance-based token retention schemes but is also compatible with typical practical LLM deployments using KV cache paging schemes such as vLLM and mixed precision quantization.

Paper Structure

This paper contains 22 sections, 1 equation, 6 figures, 3 tables, 2 algorithms.

Figures (6)

  • Figure 1: KVCrush flow: cache budget $B$ is split into $B_{important}$ (pivotal tokens via baseline methods) and $B_{representative}$ (representative proxies selected using lightweight grouping).
  • Figure 2: Latency breakdown on LongBench microbenchmark running on an Intel® Xeon® Platinum 8470 processor. H2O and H2O+KVCrush reduce KV cache size by $4\times$, which leads to a $3.2\times$ reduction in memory access latency. KVCrush adds only $\sim$0.2% overhead while improving accuracy.
  • Figure 3: Accuracy-latency trade-off on GSM8K. KMeans offers slightly higher accuracy but with 200% more latency, while KVCrush improves H2O with negligible cost. Note:Phi3 and LLaMA3 denote Phi-3-mini-4k-instruct and Meta-Llama-3-8B-Instruct.
  • Figure 4: Accuracy gain versus percentage of a fixed total cache budget allocated to KVCrush showing the trade-off between KVCrush and baseline allocation led to an empirical sweet spot of 20–50% for most workloads, while a few (e.g., narrativeqa) benefit from higher allocations.
  • Figure 5: Accuracy impact of integrating KVCrush with H2O, SnapKV, and PyramidKV on 2wikimqa. KVCrush improves both token and chunk-level pruning.
  • ...and 1 more figures