Table of Contents
Fetching ...

More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression

Jiebin Zhang, Dawei Zhu, Yifan Song, Wenhao Wu, Chuqiao Kuang, Xiaoguang Li, Lifeng Shang, Qun Liu, Sujian Li

TL;DR

This work investigates the memory bottleneck of KV caches in long-context LLMs and introduces quantized pruning, a strategy that deploys more tokens at lower precision to optimize the token-precision trade-off under fixed memory budgets. By integrating KV pruning with KV quantization, the authors demonstrate that quantized pruning consistently outperforms standalone pruning or quantization across tasks, input lengths, and model scales, with notable gains in retrieval-heavy tasks. The findings reveal robustness across pruning methods, quantization strategies, and models, and show that allocating tokens at 4-bit precision can yield superior long-context performance compared to using fewer tokens at higher precision. The work provides practical guidance for KV cache compression and suggests directions for future speedups and exploration of additional compression dimensions.

Abstract

As large language models (LLMs) process increasing context windows, the memory usage of KV cache has become a critical bottleneck during inference. The mainstream KV compression methods, including KV pruning and KV quantization, primarily focus on either token or precision dimension separately. However, these works leaving the trade-off between these two orthogonal dimensions largely under-explored. In this paper, we comprehensively investigate the token-precision trade-off in KV cache compression.Experiments demonstrate that storing more tokens in the KV cache with lower precision,a strategy we term quantized pruning, can significantly enhance the long-context performance of LLMs. In-depth analysis of the token-precision trade-off across key aspects demonstrates that, quantized pruning achieves substantial improvements in retrieval-related tasks and consistently performs well across varying input lengths. Furthermore, quantized pruning demonstrates notable stability and effectiveness across different KV pruning methods, quantization strategies, and model scales. These findings offer valuable insights into optimizing KV cache compression through balanced token-precision trade-off strategies. Our code is available at https://github.com/zhzihao/QPruningKV.

More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression

TL;DR

This work investigates the memory bottleneck of KV caches in long-context LLMs and introduces quantized pruning, a strategy that deploys more tokens at lower precision to optimize the token-precision trade-off under fixed memory budgets. By integrating KV pruning with KV quantization, the authors demonstrate that quantized pruning consistently outperforms standalone pruning or quantization across tasks, input lengths, and model scales, with notable gains in retrieval-heavy tasks. The findings reveal robustness across pruning methods, quantization strategies, and models, and show that allocating tokens at 4-bit precision can yield superior long-context performance compared to using fewer tokens at higher precision. The work provides practical guidance for KV cache compression and suggests directions for future speedups and exploration of additional compression dimensions.

Abstract

As large language models (LLMs) process increasing context windows, the memory usage of KV cache has become a critical bottleneck during inference. The mainstream KV compression methods, including KV pruning and KV quantization, primarily focus on either token or precision dimension separately. However, these works leaving the trade-off between these two orthogonal dimensions largely under-explored. In this paper, we comprehensively investigate the token-precision trade-off in KV cache compression.Experiments demonstrate that storing more tokens in the KV cache with lower precision,a strategy we term quantized pruning, can significantly enhance the long-context performance of LLMs. In-depth analysis of the token-precision trade-off across key aspects demonstrates that, quantized pruning achieves substantial improvements in retrieval-related tasks and consistently performs well across varying input lengths. Furthermore, quantized pruning demonstrates notable stability and effectiveness across different KV pruning methods, quantization strategies, and model scales. These findings offer valuable insights into optimizing KV cache compression through balanced token-precision trade-off strategies. Our code is available at https://github.com/zhzihao/QPruningKV.

Paper Structure

This paper contains 38 sections, 6 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: The token-precision trade-off under varying memory budgets on LongBench and NIAH. We report the results of SnapKV-based and PyramidKV-based quantized pruning on Llama-3 and Mistral-v0.2. We compare three configurations with approximately equivalent memory usage: 1) Using standalone KV pruning to retain $1\times$ tokens in 16-bit precision. 2) Quantized pruning by retaining $2\times$ tokens in 8-bit precision. 3) Quantized pruning by retaining $4\times$ tokens in 4-bit precision. Quantized pruning, which preserves more tokens at a lower precision, consistently outperforms standalone KV pruning methods across various budgets.
  • Figure 2: The token-precision trade-off in different input lengths. We report the results of LongBench and three subsets of RULER. We use PyramidKV-based quantized pruning.
  • Figure 3: Scaling effect on Llama family models, with PyramidKV-based quantized pruning. All models are under 1/64 KV cache budget.
  • Figure 4: Ablation of quantization strategies on quantized pruning, remaining 512 KV tokens in 4-bit.
  • Figure 5: The results of layer-wise quantized pruning on Llama-3-8B-Instruct, with SnapKV as pruning method. We use $4\times$ KV token 4-bit as baseline and report the relative change. Configurations are modified every 4 layers for the initial and final layers, while intermediate layers are reconfigured every 8 layers.
  • ...and 4 more figures