Table of Contents
Fetching ...

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, S. Kevin Zhou

TL;DR

Ada-KV identifies uniform per-head budgets in KV-cache eviction as a bottleneck for long-sequence LLM inference. It provides a theoretical bound on eviction loss and introduces a head-wise adaptive budget allocator that minimizes this bound, enabling plug-and-play enhancements to existing Top-k eviction methods like SnapKV and Pyramid. Empirically, Ada-KV consistently improves generation quality across 29 datasets in Ruler and LongBench under both question-aware and question-agnostic settings, while maintaining efficient computation via optimized kernels and FlashAttention integration. The approach demonstrates strong practical impact for scalable long-context inference and offers broad applicability to related cache-optimization strategies.

Abstract

Large Language Models have excelled in various domains but face efficiency challenges due to the growing Key-Value (KV) cache required for long-sequence inference. Recent efforts aim to reduce KV cache size by evicting vast non-critical cache elements during runtime while preserving generation quality. However, these methods typically allocate compression budgets uniformly across all attention heads, ignoring the unique attention patterns of each head. In this paper, we establish a theoretical loss upper bound between pre- and post-eviction attention output, explaining the optimization target of prior cache eviction methods, while guiding the optimization of adaptive budget allocation. Base on this, we propose {\it Ada-KV}, the first head-wise adaptive budget allocation strategy. It offers plug-and-play benefits, enabling seamless integration with prior cache eviction methods. Extensive evaluations on 13 datasets from Ruler and 16 datasets from LongBench, all conducted under both question-aware and question-agnostic scenarios, demonstrate substantial quality improvements over existing methods. Our code is available at https://github.com/FFY0/AdaKV.

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

TL;DR

Ada-KV identifies uniform per-head budgets in KV-cache eviction as a bottleneck for long-sequence LLM inference. It provides a theoretical bound on eviction loss and introduces a head-wise adaptive budget allocator that minimizes this bound, enabling plug-and-play enhancements to existing Top-k eviction methods like SnapKV and Pyramid. Empirically, Ada-KV consistently improves generation quality across 29 datasets in Ruler and LongBench under both question-aware and question-agnostic settings, while maintaining efficient computation via optimized kernels and FlashAttention integration. The approach demonstrates strong practical impact for scalable long-context inference and offers broad applicability to related cache-optimization strategies.

Abstract

Large Language Models have excelled in various domains but face efficiency challenges due to the growing Key-Value (KV) cache required for long-sequence inference. Recent efforts aim to reduce KV cache size by evicting vast non-critical cache elements during runtime while preserving generation quality. However, these methods typically allocate compression budgets uniformly across all attention heads, ignoring the unique attention patterns of each head. In this paper, we establish a theoretical loss upper bound between pre- and post-eviction attention output, explaining the optimization target of prior cache eviction methods, while guiding the optimization of adaptive budget allocation. Base on this, we propose {\it Ada-KV}, the first head-wise adaptive budget allocation strategy. It offers plug-and-play benefits, enabling seamless integration with prior cache eviction methods. Extensive evaluations on 13 datasets from Ruler and 16 datasets from LongBench, all conducted under both question-aware and question-agnostic scenarios, demonstrate substantial quality improvements over existing methods. Our code is available at https://github.com/FFY0/AdaKV.
Paper Structure (29 sections, 6 theorems, 28 equations, 12 figures, 18 tables, 2 algorithms)

This paper contains 29 sections, 6 theorems, 28 equations, 12 figures, 18 tables, 2 algorithms.

Key Result

Theorem 3.1

The $L_1$ eviction loss can be bounded by $\epsilon$: where $C=Max\left\{\lVert V_iW_i^O\rVert_{\infty} \right\}$ is a constant number, representing the max row norm .

Figures (12)

  • Figure 1: Adaptive budget allocation accommodates varying attention concentration across heads. Left: Analysis using Llama-3.1-8B-Instruct shows most heads retain nearly all attention weights with a small cache (e.g., top 5%), while dispersed heads require larger cache proportions. Right: Adaptive allocation, which shifts budgets from sparse to dispersed heads, increases the aggregated retained attention weights (from 2.26 to 2.48) and reduces eviction loss compared to uniform allocation.
  • Figure 2: Ada-SnapKV/Ada-Pyramid in One Layer
  • Figure 2: Task Analysis for Llama-3.1-70B (Question-agnostic).
  • Figure 3: Average Score on Ruler Among 13 Datasets.
  • Figure 4: Subtask Analysis on Ruler (Question-agnostic, Llama-3.1-8B-Instruct).
  • ...and 7 more figures

Theorems & Definitions (9)

  • Theorem 3.1
  • Theorem 3.2
  • Theorem 3.3
  • Theorem A.1
  • proof
  • Theorem A.2
  • proof
  • Theorem A.3
  • proof