Table of Contents
Fetching ...

Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks

Zheng Wang, Boxiao Jin, Zhongzhi Yu, Minjia Zhang

TL;DR

This work tackles the memory bottleneck of KV caches in long-context LLMs by introducing KVMerger, an adaptive KV cache merging approach. It identifies merging sets from token-level key-state similarity and applies Gaussian kernel weighted merging to compress the cache with minimal impact on generation quality. The method demonstrates robust improvements over eviction-based baselines (H2O) and prior merging methods (CaM) on LongBench and ZeroScrolls across multiple models and budgets (50% and 35%). The results highlight improved memory efficiency and preserved long-context retrieval, suggesting practical benefits for scalable deployment of large language models. Future directions include exploring alternative clustering strategies and extending the approach to more models and hybrid memory-management schemes.

Abstract

How to efficiently serve Large Language Models (LLMs) has become a pressing issue because of their huge computational cost in their autoregressive generation process. To mitigate computational costs, LLMs often employ the KV Cache technique to improve the generation speed. While improving the computational efficiency, the storage requirements of the KV cache are substantial, particularly in long-context scenarios, leading to significant memory consumption. Existing KV cache eviction methods often degrade the performance of LLMs in long-context scenarios due to the information loss introduced by eviction. In this paper, we propose a novel KV cache merging approach, called KVMerger, to achieve adaptive KV cache compression for long-context tasks without significant performance degradation under constrained memory budgets. Our approach is inspired by the intriguing observation that key states exhibit high similarity at the token level within a single sequence. To facilitate merging, we develop an effective yet straightforward merging set identification algorithm to identify suitable KV states for merging. Our merging set identification algorithm stimulates the second observation that KV cache sparsity, from similarity perspective, is independent of the dataset and remains persistent at the model level. Subsequently, we propose a Gaussian kernel weighted merging algorithm to selectively merge all states within each merging set. We conduct extensive experiments to demonstrate the effectiveness of KVMerger for long-context tasks under constrained memory budgets, applying it to models including Llama2-7B-chat and Llama2-13B-chat. Using the LongBench and ZeroScroll benchmarks, we compare our method with other KV cache compression techniques, including H2O and CaM, showing that our method achieves superior performance across tasks with both 50% and 35% KV cache budgets.

Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks

TL;DR

This work tackles the memory bottleneck of KV caches in long-context LLMs by introducing KVMerger, an adaptive KV cache merging approach. It identifies merging sets from token-level key-state similarity and applies Gaussian kernel weighted merging to compress the cache with minimal impact on generation quality. The method demonstrates robust improvements over eviction-based baselines (H2O) and prior merging methods (CaM) on LongBench and ZeroScrolls across multiple models and budgets (50% and 35%). The results highlight improved memory efficiency and preserved long-context retrieval, suggesting practical benefits for scalable deployment of large language models. Future directions include exploring alternative clustering strategies and extending the approach to more models and hybrid memory-management schemes.

Abstract

How to efficiently serve Large Language Models (LLMs) has become a pressing issue because of their huge computational cost in their autoregressive generation process. To mitigate computational costs, LLMs often employ the KV Cache technique to improve the generation speed. While improving the computational efficiency, the storage requirements of the KV cache are substantial, particularly in long-context scenarios, leading to significant memory consumption. Existing KV cache eviction methods often degrade the performance of LLMs in long-context scenarios due to the information loss introduced by eviction. In this paper, we propose a novel KV cache merging approach, called KVMerger, to achieve adaptive KV cache compression for long-context tasks without significant performance degradation under constrained memory budgets. Our approach is inspired by the intriguing observation that key states exhibit high similarity at the token level within a single sequence. To facilitate merging, we develop an effective yet straightforward merging set identification algorithm to identify suitable KV states for merging. Our merging set identification algorithm stimulates the second observation that KV cache sparsity, from similarity perspective, is independent of the dataset and remains persistent at the model level. Subsequently, we propose a Gaussian kernel weighted merging algorithm to selectively merge all states within each merging set. We conduct extensive experiments to demonstrate the effectiveness of KVMerger for long-context tasks under constrained memory budgets, applying it to models including Llama2-7B-chat and Llama2-13B-chat. Using the LongBench and ZeroScroll benchmarks, we compare our method with other KV cache compression techniques, including H2O and CaM, showing that our method achieves superior performance across tasks with both 50% and 35% KV cache budgets.
Paper Structure (18 sections, 28 equations, 5 figures, 4 tables, 2 algorithms)

This paper contains 18 sections, 28 equations, 5 figures, 4 tables, 2 algorithms.

Figures (5)

  • Figure 1: Three categories of KV cache compression techniques: KV cache quantization (left), KV cache eviction (middle), and KV cache merging (right). For the illustration of KV cache eviction, we use aggregated attention scores as the eviction signal, and k is set to 3; for KV cache merging, we illustrate many-to-one merging. The key state in red represents the state which incorporates the information of other remaining states. Value states are processed in the same way as key states.
  • Figure 2: Visualization of the cosine similarity map of key states at the token-wise level produced by running the inference process on the Llama2-7b-chat model by randomly sampling data from the SynthWiki dataset. Observations include: (1) Key states share strong similarity within one sequence across different layers and heads; (2) The similarity between key states has the property of locality, i.e., adjacent tokens exhibit higher similarity.
  • Figure 3: (a): The cosine similarity changes between the current token and its adjacent tokens across distinct attention heads and layers. We show the above changes for tokens with indices being 2000, 3000, and 4000.(b) The layer-wise compression ratios obtained by our proposed merging set identification algorithm for different samples and different tasks. (c) The comparison of long-context performance between H2O and average weighted merging with our proposed merging set identification algorithm. (d) The illustration of Gaussian kernel function with different values of $\sigma$.
  • Figure 4: The whole framework of KVMerger is comprised of two major modules. The first module is to identify the merging set through our proposed algorithm in Section 4.1. Note that those key and value states which are most sensitive to merging are excluded. The toy similarity map is used to illustrate this process in the above Merging Set Identification part, and the threshold for cosine similarity is set to 0.8. The second module is to merge key and value states within each identified merging set via Gaussian kernel weighted merging as described in Section 4.2. For Gaussian kernel weighted merging illustration, the key state in red color represents the pivotal key state, where all the remaining key states should be weighted merged to that one. Note that values on key states in the above graph represent the aggregated attention scores.
  • Figure 5: The visualization of needle-in-a-haystack test on Llama2-7B-chat with different KV cache compression methods. The x-axis represents the length of contexts, and the y-axis represents the document depth where the needle is inserted.