Towards Threshold-Free KV Cache Pruning
Xuanfan Ni, Liyan Xu, Chenyang Lyu, Longyue Wang, Mo Yu, Lemao Liu, Fandong Meng, Jie Zhou, Piji Li
TL;DR
This work tackles memory-heavy KV cache management in autoregressive LLM inference by removing reliance on input-specific pruning budgets. It introduces a threshold-free objective and a practical two-stage method, ReFreeKV, which ranks cache positions by position, then evicts least-significant entries using an input-insensitive Uni-Metric based on the Frobenius norm of the attention matrix, with a universal threshold $T=1\%$. Across 13 diverse datasets and multiple model scales, ReFreeKV achieves near full-cache performance while substantially reducing KV cache usage (average budgets around $63.7\%$–$76.0\%$), and often improves throughput with minimal latency overhead. This approach offers robust, domain-agnostic memory savings for real-world open-domain inputs and can potentially combine with other memory optimization techniques like quantization. Limitations include remaining gaps to the true optimal budget on some tasks and the lack of hard performance guarantees, motivating future work on more aggressive yet reliable pruning and broader model support.
Abstract
To reduce memory consumption during LLM inference, prior works have proposed numerous methods that focus on KV cache pruning based on various criteria. While these techniques often accomplish lossless memory reduction on many datasets, they often rely on an under-emphasized condition: a dataset/domain-specific budget size threshold needs to be pre-determined to achieve the optimal performance. However, such input-specific tuning may be considerably limited in real-world scenarios, as open-domain inputs span diverse domains, lengths and difficulty levels, without clear boundaries for pre-tuning. Thus, the dependence of an input-sensitive threshold can be an inherent limitation that may cause large degradation on arbitrary inputs. In this work, we propose a new objective that lifts the threshold constraints for robust KV pruning, calling for "threshold-free" methods that automatically adjust budget sizes while ensuring full-cache performance. We then propose a novel method ReFreeKV as the first solution fulfilling this objective, validated by intensive experiments on 13 datasets of diverse context lengths, task types, and model sizes.
