KVSlimmer: Theoretical Insights and Practical Optimizations for Asymmetric KV Merging

Lianjun Liu; Hongli An; Weiqi Yan; Xin Du; Shengchuan Zhang; Huazhong Liu; Yunshan Zhong

KVSlimmer: Theoretical Insights and Practical Optimizations for Asymmetric KV Merging

Lianjun Liu, Hongli An, Weiqi Yan, Xin Du, Shengchuan Zhang, Huazhong Liu, Yunshan Zhong

TL;DR

This work introduces KVSlimmer, an efficient algorithm that captures exact Hessian information through a mathematically exact formulation, and derives a closed-form solution utilizing only forward-pass variables, resulting in a gradient-free approach that is both memory- and time-efficient.

Abstract

The growing computational and memory demands of the Key-Value (KV) cache significantly limit the ability of Large Language Models (LLMs). While KV merging has emerged as a promising solution, existing methods that rely on empirical observations of KV asymmetry and gradient-based Hessian approximations lack a theoretical foundation and incur suboptimal compression and inference overhead. To bridge these gaps, we establish a theoretical framework that characterizes this asymmetry through the spectral energy distribution of projection weights, demonstrating that concentrated spectra in Query/Key weights induce feature homogeneity, whereas dispersed spectra in Value weights preserve heterogeneity. Then, we introduce KVSlimmer, an efficient algorithm that captures exact Hessian information through a mathematically exact formulation, and derives a closed-form solution utilizing only forward-pass variables, resulting in a gradient-free approach that is both memory- and time-efficient. Extensive experiments across various models and benchmarks demonstrate that KVSlimmer consistently outperforms SOTA methods. For instance, on Llama3.1-8B-Instruct, it improves the LongBench average score by 0.92 while reducing memory costs and latency by 29% and 28%, respectively.

KVSlimmer: Theoretical Insights and Practical Optimizations for Asymmetric KV Merging

TL;DR

Abstract

Paper Structure (24 sections, 40 equations, 14 figures, 2 tables)

This paper contains 24 sections, 40 equations, 14 figures, 2 tables.

Introduction
Related Work
Long-context Segmentation and Sliding
KV cache eviction.
KV Cache Merging.
Root of KV Asymmetry
Preliminary
Theoretical Analysis of QKV Homogeneity and Heterogeneity
KVSlimmer
Exact Hessian Derivation for Key-Key Coupling
Computation simplification
Experiments
Experimental Setup
Long Context Performance Evaluation
Runtime/Memory Efficiency
...and 9 more sections

Figures (14)

Figure 1: Comparison between AsymKV and KVSlimmer for KV cache merging.
Figure 2: Layer-wise QKV similarity and spectral analysis. Left column: Mean adjacent-token cosine similarity for Query (Q), Key (K), and Value (V), averaged over attention heads. Middle column: Eigenvalue distributions of the projection matrices $\mathbf{W}_Q$, $\mathbf{W}_K$, and $\mathbf{W}_V$, sorted in descending order. Right column: Mode-wise contribution coefficients $c_i$ (Eq. \ref{['eq:mode_contribution']}), plotted according to the eigenvalue index. The first two rows show results from Llama-3.1-8B-Instruct, while the last two rows show results from Mistral-7B-Instruct-v0.3.
Figure 3: Head-level mean alignment relationships at Layer 2 of Llama-3.1-8B-Instruct on 2WikiMQA. Each point corresponds to one attention head, positioned by its global mean cosine alignment.
Figure 4: Relative runtime of KVSlimmer compared to AsymKV across LongBench datasets.
Figure 5: Inference efficiency of decoder stage.
...and 9 more figures

KVSlimmer: Theoretical Insights and Practical Optimizations for Asymmetric KV Merging

TL;DR

Abstract

KVSlimmer: Theoretical Insights and Practical Optimizations for Asymmetric KV Merging

Authors

TL;DR

Abstract

Table of Contents

Figures (14)