More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression
Jiebin Zhang, Dawei Zhu, Yifan Song, Wenhao Wu, Chuqiao Kuang, Xiaoguang Li, Lifeng Shang, Qun Liu, Sujian Li
TL;DR
This work investigates the memory bottleneck of KV caches in long-context LLMs and introduces quantized pruning, a strategy that deploys more tokens at lower precision to optimize the token-precision trade-off under fixed memory budgets. By integrating KV pruning with KV quantization, the authors demonstrate that quantized pruning consistently outperforms standalone pruning or quantization across tasks, input lengths, and model scales, with notable gains in retrieval-heavy tasks. The findings reveal robustness across pruning methods, quantization strategies, and models, and show that allocating tokens at 4-bit precision can yield superior long-context performance compared to using fewer tokens at higher precision. The work provides practical guidance for KV cache compression and suggests directions for future speedups and exploration of additional compression dimensions.
Abstract
As large language models (LLMs) process increasing context windows, the memory usage of KV cache has become a critical bottleneck during inference. The mainstream KV compression methods, including KV pruning and KV quantization, primarily focus on either token or precision dimension separately. However, these works leaving the trade-off between these two orthogonal dimensions largely under-explored. In this paper, we comprehensively investigate the token-precision trade-off in KV cache compression.Experiments demonstrate that storing more tokens in the KV cache with lower precision,a strategy we term quantized pruning, can significantly enhance the long-context performance of LLMs. In-depth analysis of the token-precision trade-off across key aspects demonstrates that, quantized pruning achieves substantial improvements in retrieval-related tasks and consistently performs well across varying input lengths. Furthermore, quantized pruning demonstrates notable stability and effectiveness across different KV pruning methods, quantization strategies, and model scales. These findings offer valuable insights into optimizing KV cache compression through balanced token-precision trade-off strategies. Our code is available at https://github.com/zhzihao/QPruningKV.
