KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments

Junyoung Park; Dalton Jones; Matthew J Morse; Raghavv Goel; Mingu Lee; Chris Lott

KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments

Junyoung Park, Dalton Jones, Matthew J Morse, Raghavv Goel, Mingu Lee, Chris Lott

TL;DR

KeyDiff addresses the memory bottleneck of KV caching in long-context LLM inference by exploiting the geometry of cached keys rather than attention weights. It introduces an attention-free eviction policy that minimizes pairwise key similarity, thereby maximizing diversity in the KV cache and preserving tokens that are globally informative across blocks. The method, including efficient anchor-based variants and a sliding-window extension, is theoretically justified and empirically validated across Llama and Qwen models, showing small accuracy drops under tight budgets ($N$) and notable latency reductions. Practically, KeyDiff enables effective long-context inference in resource-constrained environments, achieving up to $0.04\%$ accuracy drop with an $8K$ cache budget and up to $30\%$ end-to-end latency savings, with robust performance on LongBench and Math-500 benchmarks.

Abstract

We demonstrate that geometrically distinctive keys during LLM inference tend to have high attention scores. Based on the phenomenon we propose KeyDiff, a training-free KV cache eviction method based solely on key similarity. Unlike other KV cache eviction methods, KeyDiff can process arbitrarily long prompts within strict resource constraints and efficiently generate responses. We provide a theoretical basis for KeyDiff by relating key diversity with attention scores. These results imply KeyDiff can efficiently identify the most important tokens to retain. Notably KeyDiff does not rely on attention scores, allowing the use of optimized attention mechanisms like FlashAttention. Under a strict memory allowance, we demonstrate the effectiveness of KeyDiff for the Llama and Qwen model families by observing a performance gap of less than 0.04% with 8K cache budget ($\sim$23% KV cache reduction) from the non-evicting baseline on LongBench for Llama 3.1-8B and Llama 3.2-3B. We also observe near baseline performance for Deepseek-R1-Distill-Llama-8B on the Math500 reasoning benchmark and decrease end-to-end inference latency by up to 30% compared to the other token-eviction methods.

KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments

TL;DR

Abstract

KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (19)

Theorems & Definitions (8)