Homogeneous Keys, Heterogeneous Values: Exploiting Local KV Cache Asymmetry for Long-Context LLMs
Wanyun Cui, Mingwei Xu
TL;DR
The paper tackles the challenge of extending LLM context efficiently by exposing a local KV cache asymmetry: adjacent keys are locally homogeneous while adjacent values are heterogeneous, creating a mismatch for uniform compression. It introduces AsymKV, a training-free framework that merges homogeneous keys and uses lossless, cardinality-aware value merging via Locally Merged Attention (LMA) to preserve attention outputs. The method combines Newton-like key optimization with Fisher-diagonal Hessian approximation and a cardinality-normalized value representation, achieving state-of-the-art results on LongBench across multiple base models and compression ratios. Practically, AsymKV delivers significant long-context improvements with robust efficiency, enabling scalable long-context inference without additional training, albeit with engineering considerations for integration into existing accelerators and runtimes.
Abstract
Recent advances in Large Language Models (LLMs) have highlighted the critical importance of extending context length, yet the quadratic complexity of attention mechanisms poses significant challenges for efficient long-context modeling. KV cache compression has emerged as a key approach to address this challenge. Through extensive empirical analysis, we reveal a fundamental yet previously overlooked asymmetry in KV caches: while adjacent keys receive similar attention weights ({\it local homogeneity}), adjacent values demonstrate distinct {\it heterogeneous} distributions. This key-value asymmetry reveals a critical limitation in existing compression methods that treat keys and values uniformly. To address the limitation, we propose a training-free compression framework (AsymKV) that combines homogeneity-based key merging with a mathematically proven lossless value compression. Extensive experiments demonstrate that AsymKV consistently outperforms existing long-context methods across various tasks and base models. For example, on LLaMA3.1-8B, AsymKV achieves an average score of 43.95 on LongBench, surpassing SOTA methods like H$_2$O (38.89) by a large margin.Our code can be found in this link:https://github.com/the-scale-lab/Asymkv.
