Table of Contents
Fetching ...

Beyond Token Eviction: Mixed-Dimension Budget Allocation for Efficient KV Cache Compression

Ruijie Miao, Zhiming Wang, Wang Li, Shiwei Wu, Shufan Liu, Yanbing Jiang, Tong Yang

Abstract

Key-value (KV) caching is widely used to accelerate transformer inference, but its memory cost grows linearly with input length, limiting long-context deployment. Existing token eviction methods reduce memory by discarding less important tokens, which can be viewed as a coarse form of dimensionality reduction that assigns each token either zero or full dimension. We propose MixedDimKV, a mixed-dimension KV cache compression method that allocates dimensions to tokens at a more granular level, and MixedDimKV-H, which further integrates head-level importance information. Experiments on long-context benchmarks show that MixedDimKV outperforms prior KV cache compression methods that do not rely on head-level importance profiling. When equipped with the same head-level importance information, MixedDimKV-H consistently outperforms HeadKV. Notably, our approach achieves comparable performance to full attention on LongBench with only 6.25% of the KV cache. Furthermore, in the Needle-in-a-Haystack test, our solution maintains 100% accuracy at a 50K context length while using as little as 0.26% of the cache.

Beyond Token Eviction: Mixed-Dimension Budget Allocation for Efficient KV Cache Compression

Abstract

Key-value (KV) caching is widely used to accelerate transformer inference, but its memory cost grows linearly with input length, limiting long-context deployment. Existing token eviction methods reduce memory by discarding less important tokens, which can be viewed as a coarse form of dimensionality reduction that assigns each token either zero or full dimension. We propose MixedDimKV, a mixed-dimension KV cache compression method that allocates dimensions to tokens at a more granular level, and MixedDimKV-H, which further integrates head-level importance information. Experiments on long-context benchmarks show that MixedDimKV outperforms prior KV cache compression methods that do not rely on head-level importance profiling. When equipped with the same head-level importance information, MixedDimKV-H consistently outperforms HeadKV. Notably, our approach achieves comparable performance to full attention on LongBench with only 6.25% of the KV cache. Furthermore, in the Needle-in-a-Haystack test, our solution maintains 100% accuracy at a 50K context length while using as little as 0.26% of the cache.
Paper Structure (32 sections, 1 theorem, 10 equations, 9 figures, 9 tables, 1 algorithm)

This paper contains 32 sections, 1 theorem, 10 equations, 9 figures, 9 tables, 1 algorithm.

Key Result

Theorem 4.1

The gap between eq:allocation_primal and its Lagrangian dual is bounded by a small constant $\Delta$ independent of the sequence length $N$.

Figures (9)

  • Figure 1: The two-stage dimension allocation process for KV cache compression. Candidate compression ratios are evaluated for each token to compute loss scores. An optimization objective is then applied to minimize total loss under the specified memory budget.
  • Figure 2: Comparison of HeadKV and MixedDimKV-H.
  • Figure 3: Performance comparison on RULER. The terms 'SKV', 'PKV', 'MD', 'HKV', 'MD-H' denote SnapKV, PyramidKV, MixedDimKV, HeadKV, MixedDimKV-H, respectively.
  • Figure 4: Results of Needle-in-a-Haystack test on Llama-3-8B-Instruct-1048K with KV size $128$.
  • Figure 5: Decoding latency and peak memory usage.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Theorem 4.1
  • proof : Proof of Theorem \ref{['theory:dual-gap']}