Table of Contents
Fetching ...

Predicting Future Utility: Global Combinatorial Optimization for Task-Agnostic KV Cache Eviction

Ziyao Tang, Pengkun Jiao, Xinhang Chen, Wei Liu, Shiyong Li, Jingjing Chen

TL;DR

The paper tackles the KV cache eviction bottleneck in long-context LLMs by arguing that token importance should be judged by long-horizon utility rather than instantaneous attention magnitudes. It introduces LU-KV, a framework that optimizes a global budget distribution across attention heads via a convex-hull relaxation and a marginal-utility greedy solver, with an offline profiling pipeline to enable zero-overhead online deployment. Key contributions include formalizing Oracle Importance, decomposing eviction loss into an optimality gap, and demonstrating substantial reductions in KV cache size (≈80%) with minimal performance degradation on LongBench and RULER. The work provides practical, metric-universal budgets that improve robustness across models and tasks while reducing latency and GPU memory footprint in long-context generation scenarios.

Abstract

Given the quadratic complexity of attention, KV cache eviction is vital to accelerate model inference. Current KV cache eviction methods typically rely on instantaneous heuristic metrics, implicitly assuming that score magnitudes are consistent proxies for importance across all heads. However, this overlooks the heterogeneity in predictive fidelity across attention heads. While certain heads prioritize the instantaneous contribution of tokens, others are dedicated to capturing long-horizon utility. In this paper, we propose that optimal budget allocation should be governed by the marginal utility in preserving long-term semantic information. Based on this insight, we propose LU-KV, a novel framework that optimizes head-level budget allocation through a convex-hull relaxation and a marginal-utility-based greedy solver to achieve near-optimal precision. Furthermore, we implement a data-driven offline profiling protocol to facilitate the practical deployment of LU-KV. Extensive evaluations on LongBench and RULER benchmarks demonstrate that LU-KV achieves an 80% reduction in KV cache size with minimal performance degradation, while simultaneously reducing inference latency and GPU memory footprint.

Predicting Future Utility: Global Combinatorial Optimization for Task-Agnostic KV Cache Eviction

TL;DR

The paper tackles the KV cache eviction bottleneck in long-context LLMs by arguing that token importance should be judged by long-horizon utility rather than instantaneous attention magnitudes. It introduces LU-KV, a framework that optimizes a global budget distribution across attention heads via a convex-hull relaxation and a marginal-utility greedy solver, with an offline profiling pipeline to enable zero-overhead online deployment. Key contributions include formalizing Oracle Importance, decomposing eviction loss into an optimality gap, and demonstrating substantial reductions in KV cache size (≈80%) with minimal performance degradation on LongBench and RULER. The work provides practical, metric-universal budgets that improve robustness across models and tasks while reducing latency and GPU memory footprint in long-context generation scenarios.

Abstract

Given the quadratic complexity of attention, KV cache eviction is vital to accelerate model inference. Current KV cache eviction methods typically rely on instantaneous heuristic metrics, implicitly assuming that score magnitudes are consistent proxies for importance across all heads. However, this overlooks the heterogeneity in predictive fidelity across attention heads. While certain heads prioritize the instantaneous contribution of tokens, others are dedicated to capturing long-horizon utility. In this paper, we propose that optimal budget allocation should be governed by the marginal utility in preserving long-term semantic information. Based on this insight, we propose LU-KV, a novel framework that optimizes head-level budget allocation through a convex-hull relaxation and a marginal-utility-based greedy solver to achieve near-optimal precision. Furthermore, we implement a data-driven offline profiling protocol to facilitate the practical deployment of LU-KV. Extensive evaluations on LongBench and RULER benchmarks demonstrate that LU-KV achieves an 80% reduction in KV cache size with minimal performance degradation, while simultaneously reducing inference latency and GPU memory footprint.
Paper Structure (46 sections, 16 equations, 9 figures, 8 tables)

This paper contains 46 sections, 16 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Recall of oracle importance for oracle metric and several heuristic metrics across varying compression ratios ($\sigma$), where $1$ implies full compression and $0$ implies no compression.
  • Figure 2: (a) Comparison between our greedy solver based on convex-hull relaxation (solving Eq. \ref{['eq:combinatorial_optimization_relaxd']}) and DP solution (solving Eq. \ref{['eq:combinatorial_optimization']}). (b) Shows the consistent trend of optimal local compression ratio across different downstream tasks under the same global compression ratio $\sigma$.
  • Figure 3: Comparison of aggregated layer-wise eviction loss. Ours consistently achieves the lowest and most stable loss across all layers, whereas baselines like AdaKV and PyramidKV exhibit severe loss spikes.
  • Figure 4: Heatmap visualization of per-head loss distribution $\mathcal{L}_{\ell,h}$. Baselines suffer from intense "loss bursts" (dark red blocks) in specific heads due to optimality gap, while our method effectively suppresses these spikes across the entire model.
  • Figure 5: Efficiency comparison on Llama-3.1-8b. Our method maintains comparable latency to baselines while significantly reducing memory usage in long-context scenarios.
  • ...and 4 more figures