Table of Contents
Fetching ...

EMS: Adaptive Evict-then-Merge Strategy for Head-wise KV Cache Compression Based on Global-Local Importance

Yingxin Li, Ye Li, Yuan Meng, Xinzhu Ma, Zihan Geng, Shutao Xia, Zhi Wang

TL;DR

EMS addresses the memory bottleneck of long-context KV caches by introducing a Global-Local score that balances global and local attention signals to select important tokens, paired with a head-wise Evict-then-Merge compression that exploits sparsity and redundancy across heads. The framework unifies eviction and merging, using a zero-class center to enable parallel, efficient compression while preserving critical information. Empirical results demonstrate state-of-the-art performance under extreme compression across LongBench and Needle-in-a-Haystack tasks, with substantial gains in perplexity, retrieval accuracy, and end-to-end throughput. The approach offers practical improvements for deploying LLMs with long contexts by enabling higher throughput and robust performance under tight cache budgets.

Abstract

As large language models (LLMs) continue to advance, the demand for higher quality and faster processing of long contexts across various applications is growing. KV cache is widely adopted as it stores previously generated key and value tokens, effectively reducing redundant computations during inference. However, as memory overhead becomes a significant concern, efficient compression of KV cache has gained increasing attention. Most existing methods perform compression from two perspectives: identifying important tokens and designing compression strategies. However, these approaches often produce biased distributions of important tokens due to the influence of accumulated attention scores or positional encoding. Furthermore, they overlook the sparsity and redundancy across different heads, which leads to difficulties in preserving the most effective information at the head level. To this end, we propose EMS to overcome these limitations, while achieving better KV cache compression under extreme compression ratios. Specifically, we introduce a Global-Local score that combines accumulated attention scores from both global and local KV tokens to better identify the token importance. For the compression strategy, we design an adaptive and unified Evict-then-Merge framework that accounts for the sparsity and redundancy of KV tokens across different heads. Additionally, we implement the head-wise parallel compression through a zero-class mechanism to enhance efficiency. Extensive experiments demonstrate our SOTA performance even under extreme compression ratios. EMS consistently achieves the lowest perplexity, improves scores by over 1.28 points across four LLMs on LongBench under a 256 cache budget, and preserves 95% retrieval accuracy with a cache budget less than 2% of the context length in the Needle-in-a-Haystack task.

EMS: Adaptive Evict-then-Merge Strategy for Head-wise KV Cache Compression Based on Global-Local Importance

TL;DR

EMS addresses the memory bottleneck of long-context KV caches by introducing a Global-Local score that balances global and local attention signals to select important tokens, paired with a head-wise Evict-then-Merge compression that exploits sparsity and redundancy across heads. The framework unifies eviction and merging, using a zero-class center to enable parallel, efficient compression while preserving critical information. Empirical results demonstrate state-of-the-art performance under extreme compression across LongBench and Needle-in-a-Haystack tasks, with substantial gains in perplexity, retrieval accuracy, and end-to-end throughput. The approach offers practical improvements for deploying LLMs with long contexts by enabling higher throughput and robust performance under tight cache budgets.

Abstract

As large language models (LLMs) continue to advance, the demand for higher quality and faster processing of long contexts across various applications is growing. KV cache is widely adopted as it stores previously generated key and value tokens, effectively reducing redundant computations during inference. However, as memory overhead becomes a significant concern, efficient compression of KV cache has gained increasing attention. Most existing methods perform compression from two perspectives: identifying important tokens and designing compression strategies. However, these approaches often produce biased distributions of important tokens due to the influence of accumulated attention scores or positional encoding. Furthermore, they overlook the sparsity and redundancy across different heads, which leads to difficulties in preserving the most effective information at the head level. To this end, we propose EMS to overcome these limitations, while achieving better KV cache compression under extreme compression ratios. Specifically, we introduce a Global-Local score that combines accumulated attention scores from both global and local KV tokens to better identify the token importance. For the compression strategy, we design an adaptive and unified Evict-then-Merge framework that accounts for the sparsity and redundancy of KV tokens across different heads. Additionally, we implement the head-wise parallel compression through a zero-class mechanism to enhance efficiency. Extensive experiments demonstrate our SOTA performance even under extreme compression ratios. EMS consistently achieves the lowest perplexity, improves scores by over 1.28 points across four LLMs on LongBench under a 256 cache budget, and preserves 95% retrieval accuracy with a cache budget less than 2% of the context length in the Needle-in-a-Haystack task.

Paper Structure

This paper contains 24 sections, 5 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: The KV cache compression workflow of EMS. The tokens are first partitioned according to the ranking of Global-Local score, which is calculated based on attention weights of global and local tokens. The least important tokens are then evicted, while the sub-important tokens are either merged into most important tokens or evicted by merging into the zero-class token.
  • Figure 2: Token selection patterns. The sample is taken from the gov_reportgov_report dataset, showing Top-128 selected tokens out of a total of 512 tokens. The proposed Global-Local based selection integrates the advantages of global and local viewpoints, indicating a more balanced approach.
  • Figure 3: The framework of EMS. The compression of KV cache is decoupled into two parts. For important KV selection policy, a balanced Global-Local score is designed to grasp token importance. For KV compression strategy, the Evict-then-Merge approach first removes irrelevant tokens, then applies a unified head-wise eviction and merging process.
  • Figure 4: Observations on sparsity and redundancy. The parameters $\zeta$ and $\tau$ are set to 0.95 and 0.6 here. (a) The distribution difference between key and value similarities. The top two figures depict the raw similarity, while the bottom two showcase the masked KV similarities with a threshold of 0.8. Key similarity is much more salient than value similarity. (b) The head-wise sparsity and redundancy. The blue bars represent the sparsity of each head, while the red bars denote the redundancy. Both sparsity and redundancy vary across different heads and layers.
  • Figure 5: Evict-then-Merge details. (a) Two levels of eviction. The first level of eviction is evicting the same number of irrelevant tokens. The second level of merge is merging the tokens with low similarity to zero-class token. Different heads have different eviction at the second level. (b) Unified merge and evict at decoding stage. Different heads in the same layer have different merge or evict decisions, which are unified as merge operation.
  • ...and 4 more figures