Table of Contents
Fetching ...

Identify Critical KV Cache in LLM Inference from an Output Perturbation Perspective

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, S Kevin Zhou

TL;DR

This work addresses the high memory and compute costs of KV caches in long-context LLM inference by introducing a formal, perturbation-based criterion to identify critical KV cache entries. It derives an $L_1$-based output perturbation upper bound $\mathcal{L} \le \theta$ that depends on both attention weights and the value-state projections, and proposes a two-stage perturbation-constrained greedy algorithm to select critical entries under a budget. The method is integrated into state-of-the-art eviction schemes SnapKV and AdaKV, yielding consistent improvements in generation quality on Needle-in-a-Haystack tests and LongBench benchmarks, including substantial head- and layer-level perturbation reductions in LLama and Mistral models. By moving beyond attention-weight-only selection, this approach offers a principled and robust path to efficient long-context inference with practical impact for real-world deployment of large language models.

Abstract

Large language models have revolutionized natural language processing but face significant challenges of high storage and runtime costs, due to the transformer architecture's reliance on self-attention, particularly the large Key-Value (KV) cache for long-sequence inference. Recent efforts to reduce KV cache size by pruning less critical entries based on attention weights remain empirical and lack formal grounding. This paper presents a formal study on identifying critical KV cache entries by analyzing attention output perturbation. Our analysis reveals that, beyond attention weights, the value states within KV entries and pretrained parameter matrices are also crucial. Based on this, we propose a perturbation-constrained selection algorithm that optimizes the worst-case output perturbation to identify critical entries. Evaluations on the Needle-in-a-Haystack test and Longbench benchmark show our algorithm enhances state-of-the-art cache eviction methods. Further empirical analysis confirms that our algorithm achieves lower output perturbations in over 92% attention heads in Llama model, thereby providing a significant improvement over existing methods.

Identify Critical KV Cache in LLM Inference from an Output Perturbation Perspective

TL;DR

This work addresses the high memory and compute costs of KV caches in long-context LLM inference by introducing a formal, perturbation-based criterion to identify critical KV cache entries. It derives an -based output perturbation upper bound that depends on both attention weights and the value-state projections, and proposes a two-stage perturbation-constrained greedy algorithm to select critical entries under a budget. The method is integrated into state-of-the-art eviction schemes SnapKV and AdaKV, yielding consistent improvements in generation quality on Needle-in-a-Haystack tests and LongBench benchmarks, including substantial head- and layer-level perturbation reductions in LLama and Mistral models. By moving beyond attention-weight-only selection, this approach offers a principled and robust path to efficient long-context inference with practical impact for real-world deployment of large language models.

Abstract

Large language models have revolutionized natural language processing but face significant challenges of high storage and runtime costs, due to the transformer architecture's reliance on self-attention, particularly the large Key-Value (KV) cache for long-sequence inference. Recent efforts to reduce KV cache size by pruning less critical entries based on attention weights remain empirical and lack formal grounding. This paper presents a formal study on identifying critical KV cache entries by analyzing attention output perturbation. Our analysis reveals that, beyond attention weights, the value states within KV entries and pretrained parameter matrices are also crucial. Based on this, we propose a perturbation-constrained selection algorithm that optimizes the worst-case output perturbation to identify critical entries. Evaluations on the Needle-in-a-Haystack test and Longbench benchmark show our algorithm enhances state-of-the-art cache eviction methods. Further empirical analysis confirms that our algorithm achieves lower output perturbations in over 92% attention heads in Llama model, thereby providing a significant improvement over existing methods.

Paper Structure

This paper contains 27 sections, 4 theorems, 15 equations, 9 figures, 10 tables, 2 algorithms.

Key Result

Theorem 3.2

By introducing a mask $\mathcal{N}\in \mathbb{R}^{n}$ applied through element-wise multiplication denoted by $\odot$, we can establish the relation between $A'$ and $A$ as follows:

Figures (9)

  • Figure 1: Needle-in-a-Haystack test(Integrated into SnapKV).
  • Figure 2: Needle-in-a-Haystack test(Integrated into AdaKV).
  • Figure 3: Overview of LongBench (Integrated into SnapKV).
  • Figure 4: Overview of LongBench (Integrated into AdaKV).
  • Figure 5: Perturbation reduction across heads. (Llama)
  • ...and 4 more figures

Theorems & Definitions (9)

  • Definition 3.1: Critical KV Cache Identification Problem
  • Theorem 3.2
  • proof
  • Theorem 3.3
  • proof
  • Theorem 3.5
  • proof
  • Theorem 5.1
  • proof