Table of Contents
Fetching ...

Can LLMs Maintain Fundamental Abilities under KV Cache Compression?

Xiang Liu, Zhenheng Tang, Hong Chen, Peijie Dong, Zeyu Li, Xiuze Zhou, Bo Li, Xuming Hu, Xiaowen Chu

TL;DR

<3-5 sentence high-level summary> KVFundaBench reveals that KV cache compression degrades fundamental LLM abilities in a task-dependent manner, with arithmetic reasoning and long-context generation most affected. The authors propose ShotKV, a two-phase prefill/decoding compression that preserves shot-level semantics to maintain reasoning coherence. Across datasets and models, ShotKV yields 9-18% gains on long-context generation under aggressive compression and improves latency/throughput. This work highlights the need for selective, structure-aware compression to preserve critical prompt information and complex reasoning in LLMs.

Abstract

This paper investigates an underexplored challenge in large language models (LLMs): the impact of KV cache compression methods on LLMs' fundamental capabilities. Although existing methods achieve impressive compression ratios on long-context benchmarks, their effects on core model capabilities remain understudied. We present a comprehensive benchmark KVFundaBench to systematically evaluate the effects of KV cache compression across diverse fundamental LLM capabilities, spanning world knowledge, commonsense reasoning, arithmetic reasoning, code generation, safety, and long-context understanding and generation.Our analysis reveals serval key findings: (1) \textit{Task-Dependent Degradation}; (2) \textit{Model-Type Robustness} (3) \textit{Prompt Length Vulnerability}; (4) \textit{Chunk-Level Superiority}; (5) \textit{Prompt-Gain Sensitivity}; (6) \textit{Long-Context Generation Sensitivity}. Based on our analysis of attention patterns and cross-task compression performance, we propose ShotKV, a novel compression approach that distinctly handles prefill and decoding phases while maintaining shot-level semantic coherence. Empirical results show that ShotKV achieves $9\%$-$18\%$ performance improvements on long-context generation tasks under aggressive compression ratios.

Can LLMs Maintain Fundamental Abilities under KV Cache Compression?

TL;DR

<3-5 sentence high-level summary> KVFundaBench reveals that KV cache compression degrades fundamental LLM abilities in a task-dependent manner, with arithmetic reasoning and long-context generation most affected. The authors propose ShotKV, a two-phase prefill/decoding compression that preserves shot-level semantics to maintain reasoning coherence. Across datasets and models, ShotKV yields 9-18% gains on long-context generation under aggressive compression and improves latency/throughput. This work highlights the need for selective, structure-aware compression to preserve critical prompt information and complex reasoning in LLMs.

Abstract

This paper investigates an underexplored challenge in large language models (LLMs): the impact of KV cache compression methods on LLMs' fundamental capabilities. Although existing methods achieve impressive compression ratios on long-context benchmarks, their effects on core model capabilities remain understudied. We present a comprehensive benchmark KVFundaBench to systematically evaluate the effects of KV cache compression across diverse fundamental LLM capabilities, spanning world knowledge, commonsense reasoning, arithmetic reasoning, code generation, safety, and long-context understanding and generation.Our analysis reveals serval key findings: (1) \textit{Task-Dependent Degradation}; (2) \textit{Model-Type Robustness} (3) \textit{Prompt Length Vulnerability}; (4) \textit{Chunk-Level Superiority}; (5) \textit{Prompt-Gain Sensitivity}; (6) \textit{Long-Context Generation Sensitivity}. Based on our analysis of attention patterns and cross-task compression performance, we propose ShotKV, a novel compression approach that distinctly handles prefill and decoding phases while maintaining shot-level semantic coherence. Empirical results show that ShotKV achieves - performance improvements on long-context generation tasks under aggressive compression ratios.

Paper Structure

This paper contains 50 sections, 9 equations, 9 figures, 12 tables, 1 algorithm.

Figures (9)

  • Figure 1: KV cache compression methods on long-context and arithmetic benchmarks. (a) Arithmetic benchmark shows more performance degradation than long-context benchmark. (b) Long-Context benchmark shows more sparsity in attention heatmap.
  • Figure 2: Attention heatmap on different tasks.
  • Figure 3: Cumulative attention score distribution for Long-Context and Arithmetic. (a) Overall distribution including initial sink tokens, showing high initial concentration. (b) Distribution without sink tokens (first 4 tokens removed), revealing that Arithmetic's non-sink attention is more diffuse compared to Long-Context's.
  • Figure 4: Sensitivity Analysis of Different Benchmark Categories to KV Cache Compression. The performance delta lines are calculated by \ref{['eq:performance_change']}.
  • Figure 5: Performance Comparison of KV Cache Compression Methods on KVFundaBench. Results for R1-AR (f) were obtained using the DeepSeek-R1-Distill-Llama-8B model. ShotKV is our proposed method; details can be found in Section \ref{['sec:shotkv']}.
  • ...and 4 more figures