Table of Contents
Fetching ...

Towards Threshold-Free KV Cache Pruning

Xuanfan Ni, Liyan Xu, Chenyang Lyu, Longyue Wang, Mo Yu, Lemao Liu, Fandong Meng, Jie Zhou, Piji Li

TL;DR

This work tackles memory-heavy KV cache management in autoregressive LLM inference by removing reliance on input-specific pruning budgets. It introduces a threshold-free objective and a practical two-stage method, ReFreeKV, which ranks cache positions by position, then evicts least-significant entries using an input-insensitive Uni-Metric based on the Frobenius norm of the attention matrix, with a universal threshold $T=1\%$. Across 13 diverse datasets and multiple model scales, ReFreeKV achieves near full-cache performance while substantially reducing KV cache usage (average budgets around $63.7\%$–$76.0\%$), and often improves throughput with minimal latency overhead. This approach offers robust, domain-agnostic memory savings for real-world open-domain inputs and can potentially combine with other memory optimization techniques like quantization. Limitations include remaining gaps to the true optimal budget on some tasks and the lack of hard performance guarantees, motivating future work on more aggressive yet reliable pruning and broader model support.

Abstract

To reduce memory consumption during LLM inference, prior works have proposed numerous methods that focus on KV cache pruning based on various criteria. While these techniques often accomplish lossless memory reduction on many datasets, they often rely on an under-emphasized condition: a dataset/domain-specific budget size threshold needs to be pre-determined to achieve the optimal performance. However, such input-specific tuning may be considerably limited in real-world scenarios, as open-domain inputs span diverse domains, lengths and difficulty levels, without clear boundaries for pre-tuning. Thus, the dependence of an input-sensitive threshold can be an inherent limitation that may cause large degradation on arbitrary inputs. In this work, we propose a new objective that lifts the threshold constraints for robust KV pruning, calling for "threshold-free" methods that automatically adjust budget sizes while ensuring full-cache performance. We then propose a novel method ReFreeKV as the first solution fulfilling this objective, validated by intensive experiments on 13 datasets of diverse context lengths, task types, and model sizes.

Towards Threshold-Free KV Cache Pruning

TL;DR

This work tackles memory-heavy KV cache management in autoregressive LLM inference by removing reliance on input-specific pruning budgets. It introduces a threshold-free objective and a practical two-stage method, ReFreeKV, which ranks cache positions by position, then evicts least-significant entries using an input-insensitive Uni-Metric based on the Frobenius norm of the attention matrix, with a universal threshold . Across 13 diverse datasets and multiple model scales, ReFreeKV achieves near full-cache performance while substantially reducing KV cache usage (average budgets around ), and often improves throughput with minimal latency overhead. This approach offers robust, domain-agnostic memory savings for real-world open-domain inputs and can potentially combine with other memory optimization techniques like quantization. Limitations include remaining gaps to the true optimal budget on some tasks and the lack of hard performance guarantees, motivating future work on more aggressive yet reliable pruning and broader model support.

Abstract

To reduce memory consumption during LLM inference, prior works have proposed numerous methods that focus on KV cache pruning based on various criteria. While these techniques often accomplish lossless memory reduction on many datasets, they often rely on an under-emphasized condition: a dataset/domain-specific budget size threshold needs to be pre-determined to achieve the optimal performance. However, such input-specific tuning may be considerably limited in real-world scenarios, as open-domain inputs span diverse domains, lengths and difficulty levels, without clear boundaries for pre-tuning. Thus, the dependence of an input-sensitive threshold can be an inherent limitation that may cause large degradation on arbitrary inputs. In this work, we propose a new objective that lifts the threshold constraints for robust KV pruning, calling for "threshold-free" methods that automatically adjust budget sizes while ensuring full-cache performance. We then propose a novel method ReFreeKV as the first solution fulfilling this objective, validated by intensive experiments on 13 datasets of diverse context lengths, task types, and model sizes.

Paper Structure

This paper contains 42 sections, 3 equations, 3 figures, 12 tables, 1 algorithm.

Figures (3)

  • Figure 1: The overall workflow of ReFreeKV in Section \ref{['ssec:method']}. After prefilling, tokens are initially ranked based on their positions, followed by the eviction of the least significant tokens (per layer), whose halting condition is determined by the norm value of the reduced attention matrix. The KV cache for the remaining tokens are then preserved to subsequent generation.
  • Figure 2: Performance trends of Llama3-8B, Mistral-7B, and Qwen2.5-7B across varying Uni-Metric thresholds. The x-axis represents the threshold percentage (0.1% to 5%), and the y-axis denotes the performance score normalized by the full-cache performance.
  • Figure 3: Performance vs. Efficiency Trade-off. (a) Performance retention across five datasets (solid lines). (b) Computational budget consumption (dashed lines) relative to the dense baseline. The shared legend indicates the datasets. Results show that setting the universal threshold to $1\%$ could well balance between performance and memory budget, as it maintains robust full-cache performance while substantially reducing KV cache.