Table of Contents
Fetching ...

SkipKV: Selective Skipping of KV Generation and Storage for Efficient Inference with Large Reasoning Models

Jiayi Tian, Seyedarmin Azizi, Yequan Zhao, Erfan Baghaei Potraghloo, Sean McPherson, Sharath Nittur Sridhar, Zhengyang Wang, Zheng Zhang, Massoud Pedram, Souvik Kundu

TL;DR

SkipKV tackles the KV cache memory explosion in large reasoning models by introducing a training-free, sentence-level eviction policy and adaptive steering to suppress redundant reasoning steps. The method detects sentence-level redundancy via a cumulative score combining token importance, token redundancy, and sentence similarity, and couples it with a batch-grouping strategy to reduce padding overhead. Across multiple LRMs and reasoning benchmarks, SkipKV achieves up to 26.7% accuracy gains and up to 1.7x throughput improvements at similar KV budgets, while generating substantially shorter outputs than token-level eviction baselines. The work demonstrates that high-level semantic governance of reasoning traces can deliver robust memory efficiency without retraining or quantization.

Abstract

Large reasoning models (LRMs) often cost significant key-value (KV) cache overhead, due to their linear growth with the verbose chain-of-thought (CoT) reasoning process. This costs both memory and throughput bottleneck limiting their efficient deployment. Towards reducing KV cache size during inference, we first investigate the effectiveness of existing KV cache eviction methods for CoT reasoning. Interestingly, we find that due to unstable token-wise scoring and the reduced effective KV budget caused by padding tokens, state-of-the-art (SoTA) eviction methods fail to maintain accuracy in the multi-batch setting. Additionally, these methods often generate longer sequences than the original model, as semantic-unaware token-wise eviction leads to repeated revalidation during reasoning. To address these issues, we present \textbf{SkipKV}, a \textbf{\textit{training-free}} KV compression method for selective \textit{eviction} and \textit{generation} operating at a coarse-grained sentence-level sequence removal for efficient CoT reasoning. In specific, it introduces a \textit{sentence-scoring metric} to identify and remove highly similar sentences while maintaining semantic coherence. To suppress redundant generation, SkipKV dynamically adjusts a steering vector to update the hidden activation states during inference enforcing the LRM to generate concise response. Extensive evaluations on multiple reasoning benchmarks demonstrate the effectiveness of SkipKV in maintaining up to $\mathbf{26.7}\%$ improved accuracy compared to the alternatives, at a similar compression budget. Additionally, compared to SoTA, SkipKV yields up to $\mathbf{1.6}\times$ fewer generation length while improving throughput up to $\mathbf{1.7}\times$.

SkipKV: Selective Skipping of KV Generation and Storage for Efficient Inference with Large Reasoning Models

TL;DR

SkipKV tackles the KV cache memory explosion in large reasoning models by introducing a training-free, sentence-level eviction policy and adaptive steering to suppress redundant reasoning steps. The method detects sentence-level redundancy via a cumulative score combining token importance, token redundancy, and sentence similarity, and couples it with a batch-grouping strategy to reduce padding overhead. Across multiple LRMs and reasoning benchmarks, SkipKV achieves up to 26.7% accuracy gains and up to 1.7x throughput improvements at similar KV budgets, while generating substantially shorter outputs than token-level eviction baselines. The work demonstrates that high-level semantic governance of reasoning traces can deliver robust memory efficiency without retraining or quantization.

Abstract

Large reasoning models (LRMs) often cost significant key-value (KV) cache overhead, due to their linear growth with the verbose chain-of-thought (CoT) reasoning process. This costs both memory and throughput bottleneck limiting their efficient deployment. Towards reducing KV cache size during inference, we first investigate the effectiveness of existing KV cache eviction methods for CoT reasoning. Interestingly, we find that due to unstable token-wise scoring and the reduced effective KV budget caused by padding tokens, state-of-the-art (SoTA) eviction methods fail to maintain accuracy in the multi-batch setting. Additionally, these methods often generate longer sequences than the original model, as semantic-unaware token-wise eviction leads to repeated revalidation during reasoning. To address these issues, we present \textbf{SkipKV}, a \textbf{\textit{training-free}} KV compression method for selective \textit{eviction} and \textit{generation} operating at a coarse-grained sentence-level sequence removal for efficient CoT reasoning. In specific, it introduces a \textit{sentence-scoring metric} to identify and remove highly similar sentences while maintaining semantic coherence. To suppress redundant generation, SkipKV dynamically adjusts a steering vector to update the hidden activation states during inference enforcing the LRM to generate concise response. Extensive evaluations on multiple reasoning benchmarks demonstrate the effectiveness of SkipKV in maintaining up to improved accuracy compared to the alternatives, at a similar compression budget. Additionally, compared to SoTA, SkipKV yields up to fewer generation length while improving throughput up to .

Paper Structure

This paper contains 31 sections, 15 equations, 16 figures, 3 tables, 2 algorithms.

Figures (16)

  • Figure 1: Comparison of KV cache eviction methods for a reasoning model. Marker size denotes KV memory usage. SkipKV yields shorter generation length while maintaining high accuracy under a smaller KV budget.
  • Figure 2: Comparison of KV cache eviction methods during token generation. Cached tokens marked with ${\times}$ indicate evicted positions. (a) SnapKV performs one-time eviction after prefill; (b) H2O evicts tokens with low cumulative attention scores; (c) R-KV prunes redundant tokens based on token-level similarity (purple); (d) SkipKV (ours) groups tokens within sentences (green) to evict high sentence-redundancy regions, achieving high accuracy and shorter generation length ($N$).
  • Figure 3: Left: Accuracy comparison for single- and multi-batch decoding of H2O zhang2023h2o and R-KV cai2025r. Center: Visualization of the prefill token length distribution of MATH-500, and the min-max range of each batch (batch-size, bs = 10). Right: Accuracy and generated token length versus KV budget with R-KV eviction on MATH-500 (bs = 10).
  • Figure 4: SoTA token-based eviction cai2025r often selects fragmented tokens from final answer $(6 + 9i)$ (orange boxed), causing repeated self-validation and redundant non-execution thoughts (highlighted in yellow). Blue tokens are retained, while Gray ones are evicted.
  • Figure 5: Statistics on the ratio of high-similarity sentences (top) and non-execution thoughts (bottom) generated for samples that the models answered correctly and incorrectly.
  • ...and 11 more figures