LazyEviction: Lagged KV Eviction with Attention Pattern Observation for Efficient Long Reasoning
Haoyue Zhang, Hualei Zhang, Xiaosong Ma, Jie Zhang, Song Guo
TL;DR
Long reasoning with LLMs incurs substantial KV cache memory costs, especially as CoT lengths grow. The authors uncover Token Importance Recurrence (TIR), where tokens intermittently regain high attention, and propose LazyEviction, a lagged eviction strategy guided by an observation window and recurrence-aware MRI scoring to preserve latent recurring tokens. MRI-Driven eviction maintains near-FullKV accuracy while reducing KV budget by 50%–70% across multiple models and domains, outperforming existing KV compression baselines. This approach enables efficient, scalable long-reasoning in LLMs with practical implications for memory-limited inference and broader deployment of large reasoning systems.
Abstract
Large Language Models (LLMs) exhibit enhanced capabilities by Chain-of-Thought reasoning. However, the extended reasoning sequences introduce significant GPU memory overhead due to increased key-value (KV) cache. Existing KV cache compression methods mitigate memory bottlenecks but struggle in long reasoning tasks. In this paper, we analyze attention patterns in reasoning tasks and reveal a Token Importance Recurrence phenomenon: a large proportion of tokens regain high attention after multiple decoding steps, which is failed to capture by existing works and may lead to unpredictable eviction on such periodically critical tokens. To address this, we propose LazyEviction, an observation window-based lagged eviction framework retaining latent recurring tokens by prioritized eviction based on tokens' recurrence patterns. Extensive experiments demonstrate that LazyEviction reduces KV cache by 50%~70% while maintaining comparable accuracy, outperforming existing KV cache compression baselines. Our implementation code can be found at https://github.com/Halo-949/LazyEviction.
