Table of Contents
Fetching ...

NQKV: A KV Cache Quantization Scheme Based on Normal Distribution Characteristics

Zhihang Cai, Xingjun Zhang, Zhendong Tan, Zheng Wei

TL;DR

This work tackles the KV cache memory bottleneck in large-language-model inference by introducing NQKV, a per-block KV cache quantization scheme that leverages the normal distribution characteristics observed within KV cache blocks. By quantizing KV cache entries to four-bit Normal Float (NF4) indices and storing them with block-wise quantization, NQKV achieves substantial memory savings while preserving model accuracy, enabling twice the batch size or four times longer context with near-linear throughput gains. The method is orthogonal to existing weight- and activation-quantization approaches and system-level memory optimizations, making it compatible with a range of quantization strategies. Empirical results on OPT models demonstrate negligible accuracy loss across zero-shot tasks, along with up to 9.3x throughput improvements and 60–80% additional memory savings relative to baseline memory-saving methods under large-scale workloads.

Abstract

Large Language Models (LLMs) have demonstrated remarkable proficiency across a wide range of tasks. However, LLMs often require larger batch sizes to enhance throughput or longer context lengths to meet task demands, which significantly increases the memory resource consumption of the Key-Value (KV) cache during inference, becoming a major bottleneck in LLM deployment. To address this issue, quantization is a common and straightforward approach. Currently, quantization methods for activations are limited to 8-bit, and quantization to even lower bits can lead to substantial accuracy drops. To further save space by quantizing the KV cache to even lower bits, we analyzed the element distribution of the KV cache and designed the NQKV algorithm. Since the elements within each block of the KV cache follow a normal distribution, NQKV employs per-block quantile quantization to achieve information-theoretically optimal quantization error. Without significantly compromising model output quality, NQKV enables the OPT model to perform inference with an 2x larger batch size or a 4x longer context length, and it improves throughput by 9.3x compared to when the KV cache is not used.

NQKV: A KV Cache Quantization Scheme Based on Normal Distribution Characteristics

TL;DR

This work tackles the KV cache memory bottleneck in large-language-model inference by introducing NQKV, a per-block KV cache quantization scheme that leverages the normal distribution characteristics observed within KV cache blocks. By quantizing KV cache entries to four-bit Normal Float (NF4) indices and storing them with block-wise quantization, NQKV achieves substantial memory savings while preserving model accuracy, enabling twice the batch size or four times longer context with near-linear throughput gains. The method is orthogonal to existing weight- and activation-quantization approaches and system-level memory optimizations, making it compatible with a range of quantization strategies. Empirical results on OPT models demonstrate negligible accuracy loss across zero-shot tasks, along with up to 9.3x throughput improvements and 60–80% additional memory savings relative to baseline memory-saving methods under large-scale workloads.

Abstract

Large Language Models (LLMs) have demonstrated remarkable proficiency across a wide range of tasks. However, LLMs often require larger batch sizes to enhance throughput or longer context lengths to meet task demands, which significantly increases the memory resource consumption of the Key-Value (KV) cache during inference, becoming a major bottleneck in LLM deployment. To address this issue, quantization is a common and straightforward approach. Currently, quantization methods for activations are limited to 8-bit, and quantization to even lower bits can lead to substantial accuracy drops. To further save space by quantizing the KV cache to even lower bits, we analyzed the element distribution of the KV cache and designed the NQKV algorithm. Since the elements within each block of the KV cache follow a normal distribution, NQKV employs per-block quantile quantization to achieve information-theoretically optimal quantization error. Without significantly compromising model output quality, NQKV enables the OPT model to perform inference with an 2x larger batch size or a 4x longer context length, and it improves throughput by 9.3x compared to when the KV cache is not used.

Paper Structure

This paper contains 17 sections, 12 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: The memory comsumption of OPT models in different scales under various batch size and sequence length configurations.
  • Figure 2: The memory usage percentages of different components during inference for the OPT-175B model. As the batch size and sequence length increase, the memory space allocated to the KV cache ignificantly increases.
  • Figure 3: Demonstration of the data distribution of randomly selected tokens in OPT-6.7B decoder layers. Even if the data within each token follows a normal distribution, their standard deviations may differ. Therefore, we standardized the data to make their standard deviations equal to 1, allowing for easy comparison with the standard normal distribution. For ease of observation, we also plotted the probability density function curve of the standard normal distribution in the figure.
  • Figure 4: Quantile-Quantile plots of data distribution in tokens and blocks of the OPT-6.7B model. The hidden states size of OPT-6.7B is 4096. With a block size of 256, we can obtain 16 blocks. For the sake of demonstration, only the Quantile-Quantile plots of three of these blocks are shown here. The identity line $y=x$ represents the Q-Q plot of a standard normal distribution, while other data points are plotted based on the distribution of the data. If the data points approximately lie on the line $y=x$, it indicates that the two distributions being compared are similar, that is, the data follows a normal distribution.
  • Figure 5: Block-wise quantile quantization. For demonstration purposes, let's assume the hidden states size is 24, input token dimension size is 1024, the block size is 6, and the dimensions of the keys matrix are 1024×24 (ignoring batch size). Therefore, each token of the keys can be divided into 4 blocks, and a keys matrix has 1024×4 blocks. We quantize each block separately, obtaining NF4 indices after quantization, which are stored in the KV cache. During dequantization processs, the NF4 indices stored in the KV cache can be used to look up the index table and get corresponding values, which are then restored to FP16 data type for computation.
  • ...and 4 more figures