NQKV: A KV Cache Quantization Scheme Based on Normal Distribution Characteristics
Zhihang Cai, Xingjun Zhang, Zhendong Tan, Zheng Wei
TL;DR
This work tackles the KV cache memory bottleneck in large-language-model inference by introducing NQKV, a per-block KV cache quantization scheme that leverages the normal distribution characteristics observed within KV cache blocks. By quantizing KV cache entries to four-bit Normal Float (NF4) indices and storing them with block-wise quantization, NQKV achieves substantial memory savings while preserving model accuracy, enabling twice the batch size or four times longer context with near-linear throughput gains. The method is orthogonal to existing weight- and activation-quantization approaches and system-level memory optimizations, making it compatible with a range of quantization strategies. Empirical results on OPT models demonstrate negligible accuracy loss across zero-shot tasks, along with up to 9.3x throughput improvements and 60–80% additional memory savings relative to baseline memory-saving methods under large-scale workloads.
Abstract
Large Language Models (LLMs) have demonstrated remarkable proficiency across a wide range of tasks. However, LLMs often require larger batch sizes to enhance throughput or longer context lengths to meet task demands, which significantly increases the memory resource consumption of the Key-Value (KV) cache during inference, becoming a major bottleneck in LLM deployment. To address this issue, quantization is a common and straightforward approach. Currently, quantization methods for activations are limited to 8-bit, and quantization to even lower bits can lead to substantial accuracy drops. To further save space by quantizing the KV cache to even lower bits, we analyzed the element distribution of the KV cache and designed the NQKV algorithm. Since the elements within each block of the KV cache follow a normal distribution, NQKV employs per-block quantile quantization to achieve information-theoretically optimal quantization error. Without significantly compromising model output quality, NQKV enables the OPT model to perform inference with an 2x larger batch size or a 4x longer context length, and it improves throughput by 9.3x compared to when the KV cache is not used.
