Table of Contents
Fetching ...

KVSlimmer: Theoretical Insights and Practical Optimizations for Asymmetric KV Merging

Lianjun Liu, Hongli An, Weiqi Yan, Xin Du, Shengchuan Zhang, Huazhong Liu, Yunshan Zhong

TL;DR

This work introduces KVSlimmer, an efficient algorithm that captures exact Hessian information through a mathematically exact formulation, and derives a closed-form solution utilizing only forward-pass variables, resulting in a gradient-free approach that is both memory- and time-efficient.

Abstract

The growing computational and memory demands of the Key-Value (KV) cache significantly limit the ability of Large Language Models (LLMs). While KV merging has emerged as a promising solution, existing methods that rely on empirical observations of KV asymmetry and gradient-based Hessian approximations lack a theoretical foundation and incur suboptimal compression and inference overhead. To bridge these gaps, we establish a theoretical framework that characterizes this asymmetry through the spectral energy distribution of projection weights, demonstrating that concentrated spectra in Query/Key weights induce feature homogeneity, whereas dispersed spectra in Value weights preserve heterogeneity. Then, we introduce KVSlimmer, an efficient algorithm that captures exact Hessian information through a mathematically exact formulation, and derives a closed-form solution utilizing only forward-pass variables, resulting in a gradient-free approach that is both memory- and time-efficient. Extensive experiments across various models and benchmarks demonstrate that KVSlimmer consistently outperforms SOTA methods. For instance, on Llama3.1-8B-Instruct, it improves the LongBench average score by 0.92 while reducing memory costs and latency by 29% and 28%, respectively.

KVSlimmer: Theoretical Insights and Practical Optimizations for Asymmetric KV Merging

TL;DR

This work introduces KVSlimmer, an efficient algorithm that captures exact Hessian information through a mathematically exact formulation, and derives a closed-form solution utilizing only forward-pass variables, resulting in a gradient-free approach that is both memory- and time-efficient.

Abstract

The growing computational and memory demands of the Key-Value (KV) cache significantly limit the ability of Large Language Models (LLMs). While KV merging has emerged as a promising solution, existing methods that rely on empirical observations of KV asymmetry and gradient-based Hessian approximations lack a theoretical foundation and incur suboptimal compression and inference overhead. To bridge these gaps, we establish a theoretical framework that characterizes this asymmetry through the spectral energy distribution of projection weights, demonstrating that concentrated spectra in Query/Key weights induce feature homogeneity, whereas dispersed spectra in Value weights preserve heterogeneity. Then, we introduce KVSlimmer, an efficient algorithm that captures exact Hessian information through a mathematically exact formulation, and derives a closed-form solution utilizing only forward-pass variables, resulting in a gradient-free approach that is both memory- and time-efficient. Extensive experiments across various models and benchmarks demonstrate that KVSlimmer consistently outperforms SOTA methods. For instance, on Llama3.1-8B-Instruct, it improves the LongBench average score by 0.92 while reducing memory costs and latency by 29% and 28%, respectively.
Paper Structure (24 sections, 40 equations, 14 figures, 2 tables)

This paper contains 24 sections, 40 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Comparison between AsymKV and KVSlimmer for KV cache merging.
  • Figure 2: Layer-wise QKV similarity and spectral analysis. Left column: Mean adjacent-token cosine similarity for Query (Q), Key (K), and Value (V), averaged over attention heads. Middle column: Eigenvalue distributions of the projection matrices $\mathbf{W}_Q$, $\mathbf{W}_K$, and $\mathbf{W}_V$, sorted in descending order. Right column: Mode-wise contribution coefficients $c_i$ (Eq. \ref{['eq:mode_contribution']}), plotted according to the eigenvalue index. The first two rows show results from Llama-3.1-8B-Instruct, while the last two rows show results from Mistral-7B-Instruct-v0.3.
  • Figure 3: Head-level mean alignment relationships at Layer 2 of Llama-3.1-8B-Instruct on 2WikiMQA. Each point corresponds to one attention head, positioned by its global mean cosine alignment.
  • Figure 4: Relative runtime of KVSlimmer compared to AsymKV across LongBench datasets.
  • Figure 5: Inference efficiency of decoder stage.
  • ...and 9 more figures