Table of Contents
Fetching ...

A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression

Alessio Devoto, Yu Zhao, Simone Scardapane, Pasquale Minervini

TL;DR

This work addresses the memory bottleneck of KV caches in long-context LLM inference by proposing a simple, training-free compression strategy based on the $L_2$ norm of cached key embeddings. The authors show a robust correlation between low $L_2$ norm keys and high attention, enabling eviction of high-$L_2$ norm keys while preserving model accuracy across language modeling and long-context tasks, and achieving strong results on needle-in-a-haystack and passkey retrieval benchmarks. The approach is compatible with FlashAttention and outperforms attention-score-based baselines like FastGen, with broad applicability to decoder-only transformers. Overall, the method offers a practical, effective means to reduce KV cache memory by up to 90% in certain tasks, enabling more scalable deployment of long-context LLMs.

Abstract

The deployment of large language models (LLMs) is often hindered by the extensive memory requirements of the Key-Value (KV) cache, especially as context lengths increase. Existing approaches to reduce the KV cache size involve either fine-tuning the model to learn a compression strategy or leveraging attention scores to reduce the sequence length. We analyse the attention distributions in decoder-only Transformers-based models and observe that attention allocation patterns stay consistent across most layers. Surprisingly, we find a clear correlation between the $L_2$ and the attention scores over cached KV pairs, where a low $L_2$ of a key embedding usually leads to a high attention score during decoding. This finding indicates that the influence of a KV pair is potentially determined by the key embedding itself before being queried. Based on this observation, we compress the KV cache based on the $L_2$ of key embeddings. Our experimental results show that this simple strategy can reduce the KV cache size by 50% on language modelling and needle-in-a-haystack tasks and 90% on passkey retrieval tasks without losing accuracy. Moreover, without relying on the attention scores, this approach remains compatible with FlashAttention, enabling broader applicability.

A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression

TL;DR

This work addresses the memory bottleneck of KV caches in long-context LLM inference by proposing a simple, training-free compression strategy based on the norm of cached key embeddings. The authors show a robust correlation between low norm keys and high attention, enabling eviction of high- norm keys while preserving model accuracy across language modeling and long-context tasks, and achieving strong results on needle-in-a-haystack and passkey retrieval benchmarks. The approach is compatible with FlashAttention and outperforms attention-score-based baselines like FastGen, with broad applicability to decoder-only transformers. Overall, the method offers a practical, effective means to reduce KV cache memory by up to 90% in certain tasks, enabling more scalable deployment of long-context LLMs.

Abstract

The deployment of large language models (LLMs) is often hindered by the extensive memory requirements of the Key-Value (KV) cache, especially as context lengths increase. Existing approaches to reduce the KV cache size involve either fine-tuning the model to learn a compression strategy or leveraging attention scores to reduce the sequence length. We analyse the attention distributions in decoder-only Transformers-based models and observe that attention allocation patterns stay consistent across most layers. Surprisingly, we find a clear correlation between the and the attention scores over cached KV pairs, where a low of a key embedding usually leads to a high attention score during decoding. This finding indicates that the influence of a KV pair is potentially determined by the key embedding itself before being queried. Based on this observation, we compress the KV cache based on the of key embeddings. Our experimental results show that this simple strategy can reduce the KV cache size by 50% on language modelling and needle-in-a-haystack tasks and 90% on passkey retrieval tasks without losing accuracy. Moreover, without relying on the attention scores, this approach remains compatible with FlashAttention, enabling broader applicability.
Paper Structure (23 sections, 7 equations, 26 figures)

This paper contains 23 sections, 7 equations, 26 figures.

Figures (26)

  • Figure 1: Five heads at layer 9 of Llama2-7b. Attention score (top) and $L_2$ norm (bottom) are highly correlated. We observe similar patterns across most layers and for a wide range of inputs. More examples provided in \ref{['sec:more_visualizations']}
  • Figure 2: ALr , as defined in \ref{['eq:norm-attn-diff']}, for each head and layer in Llama2-7b (left) and Llama2-7b-32k long context model (right). A lower value means a higher correlation between $L_2$ norm and attention score.
  • Figure 3: Perplexity for Llama 2-7b, Llama 3-8b and Gemma on language modelling task on wikipedia dataset.Additional results on coding dataset are available in \ref{['sec:lm_more_results']}
  • Figure 4: Overall accuracy of llama-2-7b-80k on the needle-in-a-haystack task passkey retrieval task.
  • Figure 5: Overall scores on LongBench longbench-zhang-etal-2024 of Llama3.1-8b (left) and llama-2-7b-80k (right) for different compression ratios ranging from $0\%$ to $90\%$.
  • ...and 21 more figures