One Size Does Not Fit All: Token-Wise Adaptive Compression for KV Cache

Liming Lu; Kaixi Qiu; Jiayu Zhou; Jushi Kai; Haoyan Zhang; Huanyu Wang; Jingwen Leng; Ziwei He; Zhouhan Lin

One Size Does Not Fit All: Token-Wise Adaptive Compression for KV Cache

Liming Lu, Kaixi Qiu, Jiayu Zhou, Jushi Kai, Haoyan Zhang, Huanyu Wang, Jingwen Leng, Ziwei He, Zhouhan Lin

Abstract

Despite the remarkable progress of Large Language Models (LLMs), the escalating memory footprint of the Key-Value (KV) cache remains a critical bottleneck for efficient inference. While dimensionality reduction offers a promising compression avenue, existing approaches typically either necessitate prohibitively expensive pre-training from scratch or suffer from severe performance deterioration under high compression regimes. In this work, we propose DynaKV, a novel post-training framework for low-rank KV cache compression. To the best of our knowledge, DynaKV is the first method to dynamically allocate compression rates to individual tokens according to their semantic meaning, which allows it to achieve better fidelity at aggressive compression ratios. Extensive experiments demonstrate that our method consistently outperforms existing state-of-the-art compression techniques, achieving significant memory reduction while maintaining competitive generation quality. Furthermore, our approach is orthogonal to sequence-level pruning methods. When integrated with SnapKV, DynaKV retains only 6% of the KV cache while maintaining 94% of the baseline performance on the LongBench benchmark.

One Size Does Not Fit All: Token-Wise Adaptive Compression for KV Cache

Abstract

Paper Structure (27 sections, 9 equations, 3 figures, 5 tables)

This paper contains 27 sections, 9 equations, 3 figures, 5 tables.

Introduction
Related Work
Training-free Compression Methods
Post-training Compression Methods
Architecture-Intrinsic Methods
Dynamic Compressing
KV Projection to Spectral Space
Differentiable Token-Adaptive Compression
Hard Masking at Inference Time
Retain Rate
Differentiable Masking at Training Time
Training Objective
Experiments
Experimental Settings
Models and Benchmarks
...and 12 more sections

Figures (3)

Figure 1: Illustration of the compression strategies. The diagram compares fixed-ratio compression with our proposed dynamic approach. While traditional methods apply a uniform compression rate to all tokens, DynaKV allocates variable storage budgets, assigning different retain rate of KV to different tokens based on their importance.
Figure 2: Overview of the DynaKV framework. Unlike static methods that use a uniform compression rate, DynaKV employs a token-adaptive masking mechanism to dynamically select and retain critical KV dimensions. This ensures that semantically significant context is preserved while redundancy is minimized, maintaining high performance across both short and long-context tasks.
Figure 3: LongBench average scores under varying KV cache budgets. The plot compares DynaKV with baseline methods (Palu and MatryoshkaKV) across different compression rates.

One Size Does Not Fit All: Token-Wise Adaptive Compression for KV Cache

Abstract

One Size Does Not Fit All: Token-Wise Adaptive Compression for KV Cache

Authors

Abstract

Table of Contents

Figures (3)