Table of Contents
Fetching ...

LazyEviction: Lagged KV Eviction with Attention Pattern Observation for Efficient Long Reasoning

Haoyue Zhang, Hualei Zhang, Xiaosong Ma, Jie Zhang, Song Guo

TL;DR

Long reasoning with LLMs incurs substantial KV cache memory costs, especially as CoT lengths grow. The authors uncover Token Importance Recurrence (TIR), where tokens intermittently regain high attention, and propose LazyEviction, a lagged eviction strategy guided by an observation window and recurrence-aware MRI scoring to preserve latent recurring tokens. MRI-Driven eviction maintains near-FullKV accuracy while reducing KV budget by 50%–70% across multiple models and domains, outperforming existing KV compression baselines. This approach enables efficient, scalable long-reasoning in LLMs with practical implications for memory-limited inference and broader deployment of large reasoning systems.

Abstract

Large Language Models (LLMs) exhibit enhanced capabilities by Chain-of-Thought reasoning. However, the extended reasoning sequences introduce significant GPU memory overhead due to increased key-value (KV) cache. Existing KV cache compression methods mitigate memory bottlenecks but struggle in long reasoning tasks. In this paper, we analyze attention patterns in reasoning tasks and reveal a Token Importance Recurrence phenomenon: a large proportion of tokens regain high attention after multiple decoding steps, which is failed to capture by existing works and may lead to unpredictable eviction on such periodically critical tokens. To address this, we propose LazyEviction, an observation window-based lagged eviction framework retaining latent recurring tokens by prioritized eviction based on tokens' recurrence patterns. Extensive experiments demonstrate that LazyEviction reduces KV cache by 50%~70% while maintaining comparable accuracy, outperforming existing KV cache compression baselines. Our implementation code can be found at https://github.com/Halo-949/LazyEviction.

LazyEviction: Lagged KV Eviction with Attention Pattern Observation for Efficient Long Reasoning

TL;DR

Long reasoning with LLMs incurs substantial KV cache memory costs, especially as CoT lengths grow. The authors uncover Token Importance Recurrence (TIR), where tokens intermittently regain high attention, and propose LazyEviction, a lagged eviction strategy guided by an observation window and recurrence-aware MRI scoring to preserve latent recurring tokens. MRI-Driven eviction maintains near-FullKV accuracy while reducing KV budget by 50%–70% across multiple models and domains, outperforming existing KV compression baselines. This approach enables efficient, scalable long-reasoning in LLMs with practical implications for memory-limited inference and broader deployment of large reasoning systems.

Abstract

Large Language Models (LLMs) exhibit enhanced capabilities by Chain-of-Thought reasoning. However, the extended reasoning sequences introduce significant GPU memory overhead due to increased key-value (KV) cache. Existing KV cache compression methods mitigate memory bottlenecks but struggle in long reasoning tasks. In this paper, we analyze attention patterns in reasoning tasks and reveal a Token Importance Recurrence phenomenon: a large proportion of tokens regain high attention after multiple decoding steps, which is failed to capture by existing works and may lead to unpredictable eviction on such periodically critical tokens. To address this, we propose LazyEviction, an observation window-based lagged eviction framework retaining latent recurring tokens by prioritized eviction based on tokens' recurrence patterns. Extensive experiments demonstrate that LazyEviction reduces KV cache by 50%~70% while maintaining comparable accuracy, outperforming existing KV cache compression baselines. Our implementation code can be found at https://github.com/Halo-949/LazyEviction.

Paper Structure

This paper contains 33 sections, 5 equations, 6 figures, 10 tables, 1 algorithm.

Figures (6)

  • Figure 1: Comparison of different KV Eviction methods. The dark squares represent that the token has a higher attention score. (a) Current Attention-based Eviction executes stepwise evictions using immediate attention scores. (b) Cumulative Attention-based Eviction integrates historical attention for eviction decisions. Both (a) and (b) fail to preserve recurring tokens during their low-attention intervals. (c) LazyEviction performs lagged KV evictions based on the observation window to detect latent recurring tokens and prevent prematurely discarding them.
  • Figure 2: (a) shows the performance degradation for SOTA Methods on reasoning tasks. With the same KV cache compression ratio (e.g., 50%), compared with traditional language modeling tasks dataset PG-19 rae2019compressive, the performance of both H2O and TOVA has decreased by 20% on GSM8K dataset. (b) is the visualization of the importance variation by selecting Top-50% important tokens. Tokens at the same position show different importance at different decoding steps.
  • Figure 3: Visualization of TIR. We observe attention maps across different heads of DeepSeek-R1-Distill-Llama-8B. We find most tokens(>95%) show TIR pattern. (a) and (b) show an attention map with recurring tokens and their corresponding positions. (c) statistically analyzed the MRI distribution in different models among different tasks.
  • Figure 4: Overview of our proposed LazyEviction Framework, where the dark squares represent that the token has a relatively higher attention score. (a) LazyEviction performs eviction decisions at intervals of $W$ steps. The workflow contains two key operations: (b) Dynamic MRI Tracking according to updated important timestamps, and (c) MRI-Centric Scoring during decision phases, where tokens predicted to be critical for future steps are retained. (d) The MRI-Centric Score fundamentally predicts future token importance by analyzing historical patterns of importance variation (i.e., MRI and the time elapsed since their latest timestamp).
  • Figure 5: Trade-off between accuracy and KV Cache among different datasets and models.
  • ...and 1 more figures