Table of Contents
Fetching ...

Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning

Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, Wen Xiao

TL;DR

This work introduces HeadKV, a head-level KV cache compression framework that allocates per-head budgets based on joint retrieval and reasoning importance. By estimating head importance with Retrieval Heads and Retrieval-Reasoning (R2) heads and distributing budgets accordingly, HeadKV achieves substantial memory reductions while preserving long-context retrieval and reasoning performance. Empirical results on LongBench and LooGLE across Llama-3-8B-Instruct and Mistral-7B-Instruct show HeadKV, especially HeadKV-R2, can match or exceed full KV performance at low budgets, with meaningful latency and memory benefits. The approach highlights the value of fine-grained, head-level budgeting for scalable, long-context capable LLM deployment.

Abstract

Key-Value (KV) caching is a common technique to enhance the computational efficiency of Large Language Models (LLMs), but its memory overhead grows rapidly with input length. Prior work has shown that not all tokens are equally important for text generation, proposing layer-level KV cache compression to selectively retain key information. Recognizing the distinct roles of attention heads in generation, we propose HeadKV, a head-level KV cache compression method, and HeadKV-R2, which leverages a novel contextual reasoning ability estimation for compression. Our approach operates at the level of individual heads, estimating their importance for contextual QA tasks that require both retrieval and reasoning capabilities. Extensive experiments across diverse benchmarks (LongBench, LooGLE), model architectures (e.g., Llama-3-8B-Instruct, Mistral-7B-Instruct), and long-context abilities tests demonstrate that our head-level KV cache compression significantly outperforms strong baselines, particularly in low-resource settings (KV size = 64 & 128). Notably, our method retains just 1.5% of the KV cache while achieving 97% of the performance of the full KV cache on the contextual question answering benchmark. Codes are available at https://github.com/FYYFU/HeadKV

Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning

TL;DR

This work introduces HeadKV, a head-level KV cache compression framework that allocates per-head budgets based on joint retrieval and reasoning importance. By estimating head importance with Retrieval Heads and Retrieval-Reasoning (R2) heads and distributing budgets accordingly, HeadKV achieves substantial memory reductions while preserving long-context retrieval and reasoning performance. Empirical results on LongBench and LooGLE across Llama-3-8B-Instruct and Mistral-7B-Instruct show HeadKV, especially HeadKV-R2, can match or exceed full KV performance at low budgets, with meaningful latency and memory benefits. The approach highlights the value of fine-grained, head-level budgeting for scalable, long-context capable LLM deployment.

Abstract

Key-Value (KV) caching is a common technique to enhance the computational efficiency of Large Language Models (LLMs), but its memory overhead grows rapidly with input length. Prior work has shown that not all tokens are equally important for text generation, proposing layer-level KV cache compression to selectively retain key information. Recognizing the distinct roles of attention heads in generation, we propose HeadKV, a head-level KV cache compression method, and HeadKV-R2, which leverages a novel contextual reasoning ability estimation for compression. Our approach operates at the level of individual heads, estimating their importance for contextual QA tasks that require both retrieval and reasoning capabilities. Extensive experiments across diverse benchmarks (LongBench, LooGLE), model architectures (e.g., Llama-3-8B-Instruct, Mistral-7B-Instruct), and long-context abilities tests demonstrate that our head-level KV cache compression significantly outperforms strong baselines, particularly in low-resource settings (KV size = 64 & 128). Notably, our method retains just 1.5% of the KV cache while achieving 97% of the performance of the full KV cache on the contextual question answering benchmark. Codes are available at https://github.com/FYYFU/HeadKV

Paper Structure

This paper contains 34 sections, 4 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Our proposed head-level KV cache compression method consists of two steps: (1) Head-Level Importance Score Estimation (upper part): important heads that contribute to the contextual reasoning ability are identified using Needle-in-a-Haystack tests. (2) Head-Level KV Cache Allocation (lower part): KV cache budgets for each head during the prefilling phase are allocated based on the importance score distribution identified in the first step.
  • Figure 2: Comparison of examples for head identification: Needle-in-a-Haystack test example from wu2024retrievalheadmechanisticallyexplains for identifying Retrieval Heads distribution (left), and our proposed Needle-in-a-Haystack test example for identifying Retrieval-Reasoning Heads distribution (right).
  • Figure 3: Results for different KV cache sizes (64, 128, 256, 512, 1024), showing average accuracy across six datasets from the LongBench benchmark with an average input length of 8,683 tokens. Notably, a KV cache size of 64 retains just 0.7% of the total tokens.
  • Figure 4: Head visualization for Llama-3-8B-Instruct results. The Retrieval Heads distribution is sparse to effectively differentiate between heads, while our Retrieval-Reasoning Heads has denser distribution for such differentiation. See Appendix Figure\ref{['appendix:fig-mistral_heatmap']} for Mistral-7B-Instruct results.
  • Figure 5: Needle-in-a-Haystack test results on Llama-3-8B-Instruct with KV cache = 128. We build our head-level KV cache method based on SnapKV and our proposed method significantly outperform all strong baselines. Moreover, our Retrieval-Reasoning Heads distribution maintains and improves long context retrieval ability. Results on Mistral-7B-Instruct can be found in Appendix Figure \ref{['appendix:fig-needle_mistral']}, which are consistent with results on Llama-3-8B-Instruct.
  • ...and 8 more figures