Table of Contents
Fetching ...

Homogeneous Keys, Heterogeneous Values: Exploiting Local KV Cache Asymmetry for Long-Context LLMs

Wanyun Cui, Mingwei Xu

TL;DR

The paper tackles the challenge of extending LLM context efficiently by exposing a local KV cache asymmetry: adjacent keys are locally homogeneous while adjacent values are heterogeneous, creating a mismatch for uniform compression. It introduces AsymKV, a training-free framework that merges homogeneous keys and uses lossless, cardinality-aware value merging via Locally Merged Attention (LMA) to preserve attention outputs. The method combines Newton-like key optimization with Fisher-diagonal Hessian approximation and a cardinality-normalized value representation, achieving state-of-the-art results on LongBench across multiple base models and compression ratios. Practically, AsymKV delivers significant long-context improvements with robust efficiency, enabling scalable long-context inference without additional training, albeit with engineering considerations for integration into existing accelerators and runtimes.

Abstract

Recent advances in Large Language Models (LLMs) have highlighted the critical importance of extending context length, yet the quadratic complexity of attention mechanisms poses significant challenges for efficient long-context modeling. KV cache compression has emerged as a key approach to address this challenge. Through extensive empirical analysis, we reveal a fundamental yet previously overlooked asymmetry in KV caches: while adjacent keys receive similar attention weights ({\it local homogeneity}), adjacent values demonstrate distinct {\it heterogeneous} distributions. This key-value asymmetry reveals a critical limitation in existing compression methods that treat keys and values uniformly. To address the limitation, we propose a training-free compression framework (AsymKV) that combines homogeneity-based key merging with a mathematically proven lossless value compression. Extensive experiments demonstrate that AsymKV consistently outperforms existing long-context methods across various tasks and base models. For example, on LLaMA3.1-8B, AsymKV achieves an average score of 43.95 on LongBench, surpassing SOTA methods like H$_2$O (38.89) by a large margin.Our code can be found in this link:https://github.com/the-scale-lab/Asymkv.

Homogeneous Keys, Heterogeneous Values: Exploiting Local KV Cache Asymmetry for Long-Context LLMs

TL;DR

The paper tackles the challenge of extending LLM context efficiently by exposing a local KV cache asymmetry: adjacent keys are locally homogeneous while adjacent values are heterogeneous, creating a mismatch for uniform compression. It introduces AsymKV, a training-free framework that merges homogeneous keys and uses lossless, cardinality-aware value merging via Locally Merged Attention (LMA) to preserve attention outputs. The method combines Newton-like key optimization with Fisher-diagonal Hessian approximation and a cardinality-normalized value representation, achieving state-of-the-art results on LongBench across multiple base models and compression ratios. Practically, AsymKV delivers significant long-context improvements with robust efficiency, enabling scalable long-context inference without additional training, albeit with engineering considerations for integration into existing accelerators and runtimes.

Abstract

Recent advances in Large Language Models (LLMs) have highlighted the critical importance of extending context length, yet the quadratic complexity of attention mechanisms poses significant challenges for efficient long-context modeling. KV cache compression has emerged as a key approach to address this challenge. Through extensive empirical analysis, we reveal a fundamental yet previously overlooked asymmetry in KV caches: while adjacent keys receive similar attention weights ({\it local homogeneity}), adjacent values demonstrate distinct {\it heterogeneous} distributions. This key-value asymmetry reveals a critical limitation in existing compression methods that treat keys and values uniformly. To address the limitation, we propose a training-free compression framework (AsymKV) that combines homogeneity-based key merging with a mathematically proven lossless value compression. Extensive experiments demonstrate that AsymKV consistently outperforms existing long-context methods across various tasks and base models. For example, on LLaMA3.1-8B, AsymKV achieves an average score of 43.95 on LongBench, surpassing SOTA methods like HO (38.89) by a large margin.Our code can be found in this link:https://github.com/the-scale-lab/Asymkv.

Paper Structure

This paper contains 23 sections, 18 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Contrasting distributions of local homogeneity in attentions (keys) versus local heterogeneity in values. Statistics are from Llama-2-7b-chat on the ShareGPT dataset. (a-b) demonstrate strong positive correlations between adjacent attention percentile ranks (normalized to [0,1], where 1 indicates highest attention) across all layers and heads, supporting the local homogeneity hypothesis for keys. (c-d) reveal weak or negative correlations between adjacent value similarity percentile ranks, computed from $\text{sim}(\text{val}_i,\text{val}_j)$, indicating distinct heterogeneity in values. The similarity is measured by cosine. This fundamental difference between keys and values suggests the need for separate compression strategies.
  • Figure 2: Illustration of our AsymKV mechanism. Left: Conventional approaches that uniformly merge both keys and values lead to information loss. Middle: We merge adjacent homogeneous keys for minimal loss. Right: We preserve their heterogeneous values through cardinality-aware normalization.
  • Figure 3: Performance on early topic retrieval.
  • Figure 4: Effect of different compression ratios.
  • Figure 4: Peak GPU Memory (MB)
  • ...and 3 more figures