Table of Contents
Fetching ...

Accurate Block Quantization in LLMs with Outliers

Nikita Trukhanov, Ilya Soloveychik

TL;DR

This work tackles the KV-cache memory bottleneck in autoregressive LLM inference by enabling accurate low-precision quantization using Block Floating Point formats in the presence of outliers. It introduces the K-sort algorithm, which sorts the rows of the K projection ${\boldsymbol W}_{\boldsymbol k}$ by row norms and reorders the corresponding Q projection ${\boldsymbol W}_{\boldsymbol q}$ accordingly, with the reordering fixed at compile time to preserve exact inner products, even when rotary embeddings RoPE are used. Empirical evaluation on Llama2-7B-hf with wikitext-2 shows that block size $n=128$ yields no gain, while $n=64$ and $n=32$ achieve quantization gains, with roughly a 2× reduction in K-cache memory and minimal accuracy loss. The approach relies on BFP formats (e.g., BFP12 with 4-bit mantissas and 8-bit shared exponents) and is designed to incur no inference latency, offering a practical path to longer sequences and tighter hardware integration. Overall, the paper demonstrates that a simple, compile-time channel rearrangement can substantially improve quantization quality for K-vectors in LLM inference, enabling efficient, scalable deployment on existing hardware.

Abstract

The demand for inference on extremely large scale LLMs has seen enormous growth in the recent months. It made evident the colossal shortage of dedicated hardware capable of efficient and fast processing of the involved compute and memory movement. The problem is aggravated by the exploding raise in the lengths of the sequences being processed, since those require efficient on-chip storage of the KV-cache of size proportional to the sequence length. To make the required compute feasible and fit the involved data into available memory, numerous quantization techniques have been proposed that allow accurate quantization for both weights and activations. One of the main recent breakthroughs in this direction was introduction of the family of Block Floating Point (BFP) formats characterized by a block of mantissas with a shared scale factor. These enable memory- power-, and compute- efficient hardware support of the tensor operations and provide extremely good quantization accuracy. The main issues preventing widespread application of block formats is caused by the presence of outliers in weights and activations since those affect the accuracy of the other values in the same block. In this paper, we focus on the most critical problem of limited KV-cache storage. We propose a novel approach enabling usage of low precision BFP formats without compromising the resulting model accuracy. We exploit the common channel-wise patterns exhibited by the outliers to rearrange them in such a way, that their quantization quality is significantly improved. The methodology yields 2x savings in the memory footprint without significant degradation of the model's accuracy. Importantly, the rearrangement of channels happens at the compile time and thus has no impact on the inference latency.

Accurate Block Quantization in LLMs with Outliers

TL;DR

This work tackles the KV-cache memory bottleneck in autoregressive LLM inference by enabling accurate low-precision quantization using Block Floating Point formats in the presence of outliers. It introduces the K-sort algorithm, which sorts the rows of the K projection by row norms and reorders the corresponding Q projection accordingly, with the reordering fixed at compile time to preserve exact inner products, even when rotary embeddings RoPE are used. Empirical evaluation on Llama2-7B-hf with wikitext-2 shows that block size yields no gain, while and achieve quantization gains, with roughly a 2× reduction in K-cache memory and minimal accuracy loss. The approach relies on BFP formats (e.g., BFP12 with 4-bit mantissas and 8-bit shared exponents) and is designed to incur no inference latency, offering a practical path to longer sequences and tighter hardware integration. Overall, the paper demonstrates that a simple, compile-time channel rearrangement can substantially improve quantization quality for K-vectors in LLM inference, enabling efficient, scalable deployment on existing hardware.

Abstract

The demand for inference on extremely large scale LLMs has seen enormous growth in the recent months. It made evident the colossal shortage of dedicated hardware capable of efficient and fast processing of the involved compute and memory movement. The problem is aggravated by the exploding raise in the lengths of the sequences being processed, since those require efficient on-chip storage of the KV-cache of size proportional to the sequence length. To make the required compute feasible and fit the involved data into available memory, numerous quantization techniques have been proposed that allow accurate quantization for both weights and activations. One of the main recent breakthroughs in this direction was introduction of the family of Block Floating Point (BFP) formats characterized by a block of mantissas with a shared scale factor. These enable memory- power-, and compute- efficient hardware support of the tensor operations and provide extremely good quantization accuracy. The main issues preventing widespread application of block formats is caused by the presence of outliers in weights and activations since those affect the accuracy of the other values in the same block. In this paper, we focus on the most critical problem of limited KV-cache storage. We propose a novel approach enabling usage of low precision BFP formats without compromising the resulting model accuracy. We exploit the common channel-wise patterns exhibited by the outliers to rearrange them in such a way, that their quantization quality is significantly improved. The methodology yields 2x savings in the memory footprint without significant degradation of the model's accuracy. Importantly, the rearrangement of channels happens at the compile time and thus has no impact on the inference latency.
Paper Structure (9 sections, 5 equations, 1 figure, 1 table, 1 algorithm)

This paper contains 9 sections, 5 equations, 1 figure, 1 table, 1 algorithm.

Figures (1)

  • Figure 1: Left: original ${\boldsymbol W}_{\boldsymbol k}$ and ${\boldsymbol k}$. Right: rows of ${\boldsymbol W}_{\boldsymbol k}$ have been sorted by their Euclidean norms to yield $\pi({\boldsymbol W}_{\boldsymbol k})$ and the resulting $\pi({\boldsymbol k}^\top)$; colors reflect the absolute values of the elements, from lower (green) to larger (red); BFP quantization of $\pi({\boldsymbol k}^\top)$ is more accurate than that of ${\boldsymbol k}^\top$ since the entries of the former ending up in same blocks are closer in their absolute values.