Beyond Token Eviction: Mixed-Dimension Budget Allocation for Efficient KV Cache Compression

Ruijie Miao; Zhiming Wang; Wang Li; Shiwei Wu; Shufan Liu; Yanbing Jiang; Tong Yang

Beyond Token Eviction: Mixed-Dimension Budget Allocation for Efficient KV Cache Compression

Ruijie Miao, Zhiming Wang, Wang Li, Shiwei Wu, Shufan Liu, Yanbing Jiang, Tong Yang

Abstract

Key-value (KV) caching is widely used to accelerate transformer inference, but its memory cost grows linearly with input length, limiting long-context deployment. Existing token eviction methods reduce memory by discarding less important tokens, which can be viewed as a coarse form of dimensionality reduction that assigns each token either zero or full dimension. We propose MixedDimKV, a mixed-dimension KV cache compression method that allocates dimensions to tokens at a more granular level, and MixedDimKV-H, which further integrates head-level importance information. Experiments on long-context benchmarks show that MixedDimKV outperforms prior KV cache compression methods that do not rely on head-level importance profiling. When equipped with the same head-level importance information, MixedDimKV-H consistently outperforms HeadKV. Notably, our approach achieves comparable performance to full attention on LongBench with only 6.25% of the KV cache. Furthermore, in the Needle-in-a-Haystack test, our solution maintains 100% accuracy at a 50K context length while using as little as 0.26% of the cache.

Beyond Token Eviction: Mixed-Dimension Budget Allocation for Efficient KV Cache Compression

Abstract

Paper Structure (32 sections, 1 theorem, 10 equations, 9 figures, 9 tables, 1 algorithm)

This paper contains 32 sections, 1 theorem, 10 equations, 9 figures, 9 tables, 1 algorithm.

Introduction
Related Works
Preliminary
Attention Mechanism and KV Cache
Principal Component Analysis
Solution
Head-wise Compression
Accuracy Loss Score
Latent dimension allocation
Inter-head vs. Intra-head Optimization
Implementation
Experiments
Setup
Main Results on Long-Context Benchmarks
Needle-in-a-Haystack Test
...and 17 more sections

Key Result

Theorem 4.1

The gap between eq:allocation_primal and its Lagrangian dual is bounded by a small constant $\Delta$ independent of the sequence length $N$.

Figures (9)

Figure 1: The two-stage dimension allocation process for KV cache compression. Candidate compression ratios are evaluated for each token to compute loss scores. An optimization objective is then applied to minimize total loss under the specified memory budget.
Figure 2: Comparison of HeadKV and MixedDimKV-H.
Figure 3: Performance comparison on RULER. The terms 'SKV', 'PKV', 'MD', 'HKV', 'MD-H' denote SnapKV, PyramidKV, MixedDimKV, HeadKV, MixedDimKV-H, respectively.
Figure 4: Results of Needle-in-a-Haystack test on Llama-3-8B-Instruct-1048K with KV size $128$.
Figure 5: Decoding latency and peak memory usage.
...and 4 more figures

Theorems & Definitions (2)

Theorem 4.1
proof : Proof of Theorem \ref{['theory:dual-gap']}

Beyond Token Eviction: Mixed-Dimension Budget Allocation for Efficient KV Cache Compression

Abstract

Beyond Token Eviction: Mixed-Dimension Budget Allocation for Efficient KV Cache Compression

Authors

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (2)