Table of Contents
Fetching ...

WeightedKV: Attention Scores Weighted Key-Value Cache Merging for Large Language Models

Jian Yuan, Ziwei He, Haoli Bai, Jingwen Leng, Bo Jiang

TL;DR

The paper tackles the memory and latency challenges of growing KV caches in autoregressive LLMs by introducing WeightedKV, a training-free approach that preserves essential information while keeping memory usage in check. It discards unimportant keys and deterministically merges their values into neighboring tokens via a convex combination weighted by average attention scores, grounded in an ideal merging analysis. Empirical results across multiple long-context benchmarks show WeightedKV achieving superior perplexities, especially at low budget sizes, outperforming eviction-based and competing merging-based baselines. The method offers a practical, memory-efficient improvement for long-context generation with minimal perturbation to future attention, enabling scalable inference on longer sequences.

Abstract

Large Language Models (LLMs) use key-value (KV) cache to reduce redundant computation in autoregressive generation. However, the KV cache size increases linearly during generation, leading to excessive memory usage, especially for long texts. Most KV cache compression methods evict the unimportant KV pairs to maintain a fixed cache size, which leads to the permanent loss of tokens during generation. However, singular value decomposition shows that \textit{values} do not exhibit a strong low-rank property as \textit{keys} do, suggesting that information is distributed more evenly across \textit{values}, in contrast to its more redundant distribution within \textit{keys}. Therefore, methods that evict both \textit{keys} and \textit{values} risk losing crucial information and compromise context integrity, ultimately degrading the output quality. To address this problem, we propose WeightedKV, a novel, training-free approach that discards the \textit{keys} of less important tokens, while merging their \textit{values} into neighboring tokens via a convex combination weighted by their average attention scores. In this way, the retained \textit{keys} serve as anchors that guide the generation process, while the merged \textit{values} provide a rich contextual backdrop. We assess our method on four widely used language modeling datasets, demonstrating superior performance compared to all baseline methods, particularly with a lower budget ratio.

WeightedKV: Attention Scores Weighted Key-Value Cache Merging for Large Language Models

TL;DR

The paper tackles the memory and latency challenges of growing KV caches in autoregressive LLMs by introducing WeightedKV, a training-free approach that preserves essential information while keeping memory usage in check. It discards unimportant keys and deterministically merges their values into neighboring tokens via a convex combination weighted by average attention scores, grounded in an ideal merging analysis. Empirical results across multiple long-context benchmarks show WeightedKV achieving superior perplexities, especially at low budget sizes, outperforming eviction-based and competing merging-based baselines. The method offers a practical, memory-efficient improvement for long-context generation with minimal perturbation to future attention, enabling scalable inference on longer sequences.

Abstract

Large Language Models (LLMs) use key-value (KV) cache to reduce redundant computation in autoregressive generation. However, the KV cache size increases linearly during generation, leading to excessive memory usage, especially for long texts. Most KV cache compression methods evict the unimportant KV pairs to maintain a fixed cache size, which leads to the permanent loss of tokens during generation. However, singular value decomposition shows that \textit{values} do not exhibit a strong low-rank property as \textit{keys} do, suggesting that information is distributed more evenly across \textit{values}, in contrast to its more redundant distribution within \textit{keys}. Therefore, methods that evict both \textit{keys} and \textit{values} risk losing crucial information and compromise context integrity, ultimately degrading the output quality. To address this problem, we propose WeightedKV, a novel, training-free approach that discards the \textit{keys} of less important tokens, while merging their \textit{values} into neighboring tokens via a convex combination weighted by their average attention scores. In this way, the retained \textit{keys} serve as anchors that guide the generation process, while the merged \textit{values} provide a rich contextual backdrop. We assess our method on four widely used language modeling datasets, demonstrating superior performance compared to all baseline methods, particularly with a lower budget ratio.

Paper Structure

This paper contains 12 sections, 6 equations, 4 figures, 1 table, 1 algorithm.

Figures (4)

  • Figure 1: Normalized singular values of KV averaged over the first 10 sequences truncated to length 1k in the PG19 test set.
  • Figure 2: Cosine similarity between attention weights with merging values and without merging values at step 100 on the books from PG19.
  • Figure 3: Compression process on a toy attention map with a maximum cache size of 4. Numbers in blocks represent average attention scores of tokens, while the red boxes indicate the values to be merged.
  • Figure 4: Comparison between WeightedKV and its eviction variant.