Table of Contents
Fetching ...

ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models

Akshat Ramachandran, Marina Neseem, Charbel Sakr, Rangharajan Venkatesan, Brucek Khailany, Tushar Krishna

TL;DR

The paper tackles the memory bottleneck of long-output reasoning in LRMs caused by KV-cache growth. It introduces ThinKV, a thought-adaptive, hybrid quantization–eviction framework that leverages attention sparsity to classify CoT into reasoning, execution, and transition thoughts, and then applies TBQ and TBE in tandem with a Continuous Thinking kernel to reuse memory without costly compaction. The approach achieves near-lossless accuracy with less than 5% of the original KV cache and up to 5.8x throughput gains across math and coding benchmarks, demonstrating a favorable Pareto frontier between memory savings and accuracy. This algorithm–system co-design enables scalable, efficient long-output inference on commodity hardware and offers practical guidance for deploying LRMs with aggressive KV-cache compression.

Abstract

The long-output context generation of large reasoning models enables extended chain of thought (CoT) but also drives rapid growth of the key-value (KV) cache, quickly overwhelming GPU memory. To address this challenge, we propose ThinKV, a thought-adaptive KV cache compression framework. ThinKV is based on the observation that attention sparsity reveals distinct thought types with varying importance within the CoT. It applies a hybrid quantization-eviction strategy, assigning token precision by thought importance and progressively evicting tokens from less critical thoughts as reasoning trajectories evolve. Furthermore, to implement ThinKV, we design a kernel that extends PagedAttention to enable efficient reuse of evicted tokens' memory slots, eliminating compaction overheads. Extensive experiments on DeepSeek-R1-Distill, GPT-OSS, and NVIDIA AceReason across mathematics and coding benchmarks show that ThinKV achieves near-lossless accuracy with less than 5% of the original KV cache, while improving performance with up to 5.8x higher inference throughput over state-of-the-art baselines.

ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models

TL;DR

The paper tackles the memory bottleneck of long-output reasoning in LRMs caused by KV-cache growth. It introduces ThinKV, a thought-adaptive, hybrid quantization–eviction framework that leverages attention sparsity to classify CoT into reasoning, execution, and transition thoughts, and then applies TBQ and TBE in tandem with a Continuous Thinking kernel to reuse memory without costly compaction. The approach achieves near-lossless accuracy with less than 5% of the original KV cache and up to 5.8x throughput gains across math and coding benchmarks, demonstrating a favorable Pareto frontier between memory savings and accuracy. This algorithm–system co-design enables scalable, efficient long-output inference on commodity hardware and offers practical guidance for deploying LRMs with aggressive KV-cache compression.

Abstract

The long-output context generation of large reasoning models enables extended chain of thought (CoT) but also drives rapid growth of the key-value (KV) cache, quickly overwhelming GPU memory. To address this challenge, we propose ThinKV, a thought-adaptive KV cache compression framework. ThinKV is based on the observation that attention sparsity reveals distinct thought types with varying importance within the CoT. It applies a hybrid quantization-eviction strategy, assigning token precision by thought importance and progressively evicting tokens from less critical thoughts as reasoning trajectories evolve. Furthermore, to implement ThinKV, we design a kernel that extends PagedAttention to enable efficient reuse of evicted tokens' memory slots, eliminating compaction overheads. Extensive experiments on DeepSeek-R1-Distill, GPT-OSS, and NVIDIA AceReason across mathematics and coding benchmarks show that ThinKV achieves near-lossless accuracy with less than 5% of the original KV cache, while improving performance with up to 5.8x higher inference throughput over state-of-the-art baselines.

Paper Structure

This paper contains 52 sections, 1 theorem, 13 equations, 16 figures, 10 tables, 1 algorithm.

Key Result

Theorem 1

Given $q\in\mathbb{R}^{1\times d}$, $K\in\mathbb{R}^{n\times d}$, $V\in\mathbb{R}^{n\times d}$, define For any permutation matrix $\Pi\in\mathbb{R}^{n\times n}$,

Figures (16)

  • Figure 1: Illustrative comparison of KV cache compression methods as tokens are generated. (a) Existing techniques and (b) ThinKV (Ours). (c) Accuracy vs. TPOT comparison for GPT-OSS-20B.
  • Figure 1: Comparison of ThinKV with KV quantization baselines.
  • Figure 2: Accuracy compression tradeoff.
  • Figure 3: Layer-wise attention sparsity across decode steps, aligned with R, T and E thoughts.
  • Figure 4: Counterfactual importance of thought categories.
  • ...and 11 more figures

Theorems & Definitions (7)

  • Definition 1: LRM Thought Decomposition
  • Definition 2: Thought-Adaptive Quantization
  • Definition 3: Thought-Adaptive Eviction
  • Theorem 1: KV Permutation Invariance of Attention
  • proof
  • Remark
  • Remark