Table of Contents
Fetching ...

KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments

Junyoung Park, Dalton Jones, Matthew J Morse, Raghavv Goel, Mingu Lee, Chris Lott

TL;DR

KeyDiff addresses the memory bottleneck of KV caching in long-context LLM inference by exploiting the geometry of cached keys rather than attention weights. It introduces an attention-free eviction policy that minimizes pairwise key similarity, thereby maximizing diversity in the KV cache and preserving tokens that are globally informative across blocks. The method, including efficient anchor-based variants and a sliding-window extension, is theoretically justified and empirically validated across Llama and Qwen models, showing small accuracy drops under tight budgets ($N$) and notable latency reductions. Practically, KeyDiff enables effective long-context inference in resource-constrained environments, achieving up to $0.04\%$ accuracy drop with an $8K$ cache budget and up to $30\%$ end-to-end latency savings, with robust performance on LongBench and Math-500 benchmarks.

Abstract

We demonstrate that geometrically distinctive keys during LLM inference tend to have high attention scores. Based on the phenomenon we propose KeyDiff, a training-free KV cache eviction method based solely on key similarity. Unlike other KV cache eviction methods, KeyDiff can process arbitrarily long prompts within strict resource constraints and efficiently generate responses. We provide a theoretical basis for KeyDiff by relating key diversity with attention scores. These results imply KeyDiff can efficiently identify the most important tokens to retain. Notably KeyDiff does not rely on attention scores, allowing the use of optimized attention mechanisms like FlashAttention. Under a strict memory allowance, we demonstrate the effectiveness of KeyDiff for the Llama and Qwen model families by observing a performance gap of less than 0.04% with 8K cache budget ($\sim$23% KV cache reduction) from the non-evicting baseline on LongBench for Llama 3.1-8B and Llama 3.2-3B. We also observe near baseline performance for Deepseek-R1-Distill-Llama-8B on the Math500 reasoning benchmark and decrease end-to-end inference latency by up to 30% compared to the other token-eviction methods.

KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments

TL;DR

KeyDiff addresses the memory bottleneck of KV caching in long-context LLM inference by exploiting the geometry of cached keys rather than attention weights. It introduces an attention-free eviction policy that minimizes pairwise key similarity, thereby maximizing diversity in the KV cache and preserving tokens that are globally informative across blocks. The method, including efficient anchor-based variants and a sliding-window extension, is theoretically justified and empirically validated across Llama and Qwen models, showing small accuracy drops under tight budgets () and notable latency reductions. Practically, KeyDiff enables effective long-context inference in resource-constrained environments, achieving up to accuracy drop with an cache budget and up to end-to-end latency savings, with robust performance on LongBench and Math-500 benchmarks.

Abstract

We demonstrate that geometrically distinctive keys during LLM inference tend to have high attention scores. Based on the phenomenon we propose KeyDiff, a training-free KV cache eviction method based solely on key similarity. Unlike other KV cache eviction methods, KeyDiff can process arbitrarily long prompts within strict resource constraints and efficiently generate responses. We provide a theoretical basis for KeyDiff by relating key diversity with attention scores. These results imply KeyDiff can efficiently identify the most important tokens to retain. Notably KeyDiff does not rely on attention scores, allowing the use of optimized attention mechanisms like FlashAttention. Under a strict memory allowance, we demonstrate the effectiveness of KeyDiff for the Llama and Qwen model families by observing a performance gap of less than 0.04% with 8K cache budget (23% KV cache reduction) from the non-evicting baseline on LongBench for Llama 3.1-8B and Llama 3.2-3B. We also observe near baseline performance for Deepseek-R1-Distill-Llama-8B on the Math500 reasoning benchmark and decrease end-to-end inference latency by up to 30% compared to the other token-eviction methods.

Paper Structure

This paper contains 50 sections, 5 theorems, 21 equations, 19 figures, 15 tables.

Key Result

Lemma 3.1

Suppose that for a fixed query token $q$, there is a set of key tokens $\{k_i\}_{i=1}^n$ such that $||k_i||_2^2 < M, \; \forall \;i$. Without loss of generality suppose $||q||=1$ and assume ${k^*}$ is a key not in $\{k_i\}_{i=1}^n$ with $||{k^*}||_2^2 < M$ that has attention weight $w > 0$. Then, fo

Figures (19)

  • Figure 1: An example of block prompt processing with KV cache eviction. The input prompt having length of 7 is segmented by three blocks, and a transformer layer in LLM processes each block by (1) computing key-value states from inputs, (2) computing attention, (3) computing the eviction score, and (4) performing eviction based on the eviction score to satisfy the memory constraints (e.g., at most 4 tokens can reside in the cache). After each block processing, the KV cache is updated and passed to the next round of block processing, satisfying imposed memory constraints on the KV cache.
  • Figure 2: Cosine similarity of the keys and attention weights. Measured from Llama 3.2-3B-Instruct and the first sample from the NarrativeQA dataset in LongBench. Truncated to the first 64 tokens for visualization.
  • Figure 3: An overview of KeyDiff. (1)KeyDiff first computes the anchor vector by taking the average of the keys in the KV cache, (2) computes the cosine similarity between the keys and the anchor resulting in eviction scores whose color intensities indicate the score values, and (3) retains the KV pairs with the lowest similarities.
  • Figure 4: PCA embedding of keys and queries from Llama 3.2 3B
  • Figure 5: (a, b, and c) PCA Visualizations in two dimensions of a key cache managed with Sink, TOVA, and KeyDiff. Retained tokens are blue, while evicted tokens are orange. Keys are taken from layer $5$ and head $3$ of Llama3.2-3B-Instruct, and generated using the NarrativeQA dataset. (d) PCA visualization of the retained keys for each KV cache eviction method.
  • ...and 14 more figures

Theorems & Definitions (8)

  • Lemma 3.1
  • Theorem 3.2
  • Theorem C.1
  • proof
  • Lemma C.2
  • proof
  • Theorem C.3
  • proof