Table of Contents
Fetching ...

CAOTE: KV Cache Selection for LLMs via Attention Output Error-Based Token Eviction

Raghavv Goel, Junyoung Park, Mukul Gagrani, Dalton Jones, Matthew Morse, Harper Langston, Mingu Lee, Chris Lott

TL;DR

CAOTE introduces a novel, closed-form eviction criterion for KV cache management in long-context LLMs by minimizing eviction-induced changes to the attention output through a value-vector–aware score: $c_j=\frac{\alpha_j}{1-\alpha_j}\| VA^{T}-v_j\|_2$. This value-centric metric can be combined with existing score-based eviction strategies (meta-eviction) and extended via normalization (H2O) or a fast approximation (FastCAOTE) to reduce computation. Empirical results across LongBench, Needle-in-Haystack, and perplexity benchmarks on Llama3 and Qwen2.5 show consistent gains in accuracy and perplexity when CAOTE (and its Fast variant) is used alongside state-of-the-art eviction methods. The approach offers a practical, model-agnostic augmentation to KV-cache eviction with low overhead, enabling more efficient long-context inference on resource-constrained devices.

Abstract

While long context support of large language models has extended their abilities, it also incurs challenges in memory and compute which becomes crucial bottlenecks in resource-restricted devices. Token eviction, a widely adopted post-training methodology designed to alleviate the bottlenecks by evicting less important tokens from the cache, typically uses attention scores as proxy metrics for token importance. However, one major limitation of attention score as a token-wise importance metrics is that it lacks the information about contribution of tokens to the attention output. In this paper, we propose a simple eviction criterion based on the contribution of cached tokens to attention outputs. Our method, CAOTE, optimizes for eviction error due to token eviction, by seamlessly integrating attention scores and value vectors. This is the first method which uses value tokens on top of attention-based eviction scores in closed-form. Additionally, CAOTE can act as a meta-heuristic method with flexible usage with any token eviction method. We show that CAOTE, when combined with the state-of-the-art attention score-based methods, always improves accuracies on the downstream task, indicating the importance of leveraging information from values during token eviction process.

CAOTE: KV Cache Selection for LLMs via Attention Output Error-Based Token Eviction

TL;DR

CAOTE introduces a novel, closed-form eviction criterion for KV cache management in long-context LLMs by minimizing eviction-induced changes to the attention output through a value-vector–aware score: . This value-centric metric can be combined with existing score-based eviction strategies (meta-eviction) and extended via normalization (H2O) or a fast approximation (FastCAOTE) to reduce computation. Empirical results across LongBench, Needle-in-Haystack, and perplexity benchmarks on Llama3 and Qwen2.5 show consistent gains in accuracy and perplexity when CAOTE (and its Fast variant) is used alongside state-of-the-art eviction methods. The approach offers a practical, model-agnostic augmentation to KV-cache eviction with low overhead, enabling more efficient long-context inference on resource-constrained devices.

Abstract

While long context support of large language models has extended their abilities, it also incurs challenges in memory and compute which becomes crucial bottlenecks in resource-restricted devices. Token eviction, a widely adopted post-training methodology designed to alleviate the bottlenecks by evicting less important tokens from the cache, typically uses attention scores as proxy metrics for token importance. However, one major limitation of attention score as a token-wise importance metrics is that it lacks the information about contribution of tokens to the attention output. In this paper, we propose a simple eviction criterion based on the contribution of cached tokens to attention outputs. Our method, CAOTE, optimizes for eviction error due to token eviction, by seamlessly integrating attention scores and value vectors. This is the first method which uses value tokens on top of attention-based eviction scores in closed-form. Additionally, CAOTE can act as a meta-heuristic method with flexible usage with any token eviction method. We show that CAOTE, when combined with the state-of-the-art attention score-based methods, always improves accuracies on the downstream task, indicating the importance of leveraging information from values during token eviction process.

Paper Structure

This paper contains 34 sections, 3 theorems, 33 equations, 2 figures, 14 tables.

Key Result

Theorem 3.1

Given a new input token that exceeds the budget ($b$) by $1$. A token needs to be evicted. For any token $j$ being evicted, given the retention scores pre-eviction and post-eviction for any token $i \ne j$ as $\alpha_{i}$ and $\alpha_{i}'$ respectively, then the following relation holds:

Figures (2)

  • Figure 1: General flow of cache eviction when CAOTE is integrated with existing cache eviction methods. We compute the impact of removal of each token to the attention output, this is same as eviction error (or CAOTE score: $c_{1}, c_{2}, \dots c_{n+1}$). The token with the least impact is removed.
  • Figure 2: Needle-In-A-Haystack accuracies of Llama 3.1-8B-Instruct with token eviction with $6$k cache budget.

Theorems & Definitions (6)

  • Theorem 3.1
  • proof
  • Theorem 3.2
  • proof
  • Theorem 4.1
  • proof