Table of Contents
Fetching ...

CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences

Ziran Qin, Yuchen Cao, Mingbao Lin, Wen Hu, Shixuan Fan, Ke Cheng, Weiyao Lin, Jianguo Li

TL;DR

CAKE introduces a cascading and adaptive KV cache eviction framework for long-context LLMs. By quantifying per-layer attention dynamics through spatial dispersion and temporal shift, it allocates memory adaptively via a preference score and employs cascading prefilling to bound memory usage. An attention-shift tolerant eviction indicator preserves tokens with sustained importance and low volatility, while experiments on LongBench and NeedleBench show CAKE outperforms baselines and matches full-cache performance under modest budgets, with substantial reductions in memory and decoding latency. The approach is compatible with existing eviction methods and demonstrates strong generalization across architectures and tasks, offering a practical solution for memory-constrained inference at scale.

Abstract

Large language models (LLMs) excel at processing long sequences, boosting demand for key-value (KV) caching. While recent efforts to evict KV cache have alleviated the inference burden, they often fail to allocate resources rationally across layers with different attention patterns. In this paper, we introduce Cascading and Adaptive KV cache Eviction (CAKE), a novel approach that frames KV cache eviction as a "cake-slicing problem." CAKE assesses layer-specific preferences by considering attention dynamics in both spatial and temporal dimensions, allocates rational cache size for layers accordingly, and manages memory constraints in a cascading manner. This approach enables a global view of cache allocation, adaptively distributing resources across diverse attention mechanisms while maintaining memory budgets. CAKE also employs a new eviction indicator that considers the shifting importance of tokens over time, addressing limitations in existing methods that overlook temporal dynamics. Comprehensive experiments on LongBench and NeedleBench show that CAKE maintains model performance with only 3.2% of the KV cache and consistently outperforms current baselines across various models and memory constraints, particularly in low-memory settings. Additionally, CAKE achieves over 10x speedup in decoding latency compared to full cache when processing contexts of 128K tokens with FlashAttention-2. Our code is available at https://github.com/antgroup/cakekv.

CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences

TL;DR

CAKE introduces a cascading and adaptive KV cache eviction framework for long-context LLMs. By quantifying per-layer attention dynamics through spatial dispersion and temporal shift, it allocates memory adaptively via a preference score and employs cascading prefilling to bound memory usage. An attention-shift tolerant eviction indicator preserves tokens with sustained importance and low volatility, while experiments on LongBench and NeedleBench show CAKE outperforms baselines and matches full-cache performance under modest budgets, with substantial reductions in memory and decoding latency. The approach is compatible with existing eviction methods and demonstrates strong generalization across architectures and tasks, offering a practical solution for memory-constrained inference at scale.

Abstract

Large language models (LLMs) excel at processing long sequences, boosting demand for key-value (KV) caching. While recent efforts to evict KV cache have alleviated the inference burden, they often fail to allocate resources rationally across layers with different attention patterns. In this paper, we introduce Cascading and Adaptive KV cache Eviction (CAKE), a novel approach that frames KV cache eviction as a "cake-slicing problem." CAKE assesses layer-specific preferences by considering attention dynamics in both spatial and temporal dimensions, allocates rational cache size for layers accordingly, and manages memory constraints in a cascading manner. This approach enables a global view of cache allocation, adaptively distributing resources across diverse attention mechanisms while maintaining memory budgets. CAKE also employs a new eviction indicator that considers the shifting importance of tokens over time, addressing limitations in existing methods that overlook temporal dynamics. Comprehensive experiments on LongBench and NeedleBench show that CAKE maintains model performance with only 3.2% of the KV cache and consistently outperforms current baselines across various models and memory constraints, particularly in low-memory settings. Additionally, CAKE achieves over 10x speedup in decoding latency compared to full cache when processing contexts of 128K tokens with FlashAttention-2. Our code is available at https://github.com/antgroup/cakekv.

Paper Structure

This paper contains 30 sections, 2 theorems, 22 equations, 15 figures, 14 tables.

Key Result

Proposition 1

For any layer $l\in [L]$, the allocated budget size decreases monotonically from stage $l$ to $L-1$: where $B_l^{(m)}$ is the redistributed cache budget for layer $l$ at stage $m$, calculated based on the current obtained preference score $\mathcal{P}$:

Figures (15)

  • Figure 1: Variation in spatial (a, b) and temporal (c, d) characteristics of attention patterns. We provide toy examples (left) and real examples from Mistral's different layers (right) for illustration. For more detailed analysis and visualization of attention dynamics, please refer to Appendix \ref{['detailvis']}.
  • Figure 2: Illustration of CAKE compared with existing cache allocation strategies. (a) Uniform cache allocation xiao2023efficientzhang2024h2oli2024snapkv; (b) Fixed-shape cache allocation cai2024pyramidkvyang2024pyramidinfer; (c) Preference-prioritized adaptive cache allocation used in CAKE. Compared to (a) and (b), CAKE adjusts allocation ratios across different layers with layer preferences, adapting to various contexts and models, with given memory budgets.
  • Figure 3: Analysis of attention dynamics. (a) Heatmap for spatial attention dispersion (upper) and temporal attention shift (lower), red color with high value, while blue for low value. The x-axis represents samples, and the y-axis represents layers. (b) Variation of spatial attention dispersion (upper) and temporal attention shift (lower) across layers and models. The experimental data is derived from the LongBench dataset bai2023longbench.
  • Figure 5: Average score among 16 datasets of LongBench under different cache budgets.
  • Figure 6: Performance comparison on NeedleBench 32K.
  • ...and 10 more figures

Theorems & Definitions (4)

  • Proposition 1
  • Theorem 1
  • proof : Proof of Proposition \ref{['th1']}
  • proof : Proof of Theorem \ref{['th2']}