Table of Contents
Fetching ...

PrefixQuant: Eliminating Outliers by Prefixed Tokens for Large Language Models Quantization

Mengzhao Chen, Yi Liu, Jiahao Wang, Yi Bin, Wenqi Shao, Ping Luo

TL;DR

PrefixQuant tackles token-wise outliers in LLM activations by prefixing high-frequency outlier tokens in the KV cache, confinement of outliers to prefixed positions, and offline prefilling, enabling efficient, training-free outlier handling plus block-wise fine-tuning with trainable quantizers. The approach yields state-of-the-art or competitive accuracy across dynamic and static quantization settings (W4A4KV4, W4A8KV4) and multiple model families, while delivering substantial end-to-end speedups in prefilling and decoding. Through extensive ablations and comparisons, PrefixQuant demonstrates the effectiveness of prefixed tokens, content-aware prefixed selection, and per-block optimization in reducing quantization error and maintaining performance. The method is compatible with existing quantization schemes and shows broad practical impact for deploying LLMs with lower precision on real hardware.

Abstract

Existing weight-activation quantization methods for Large Language Models (LLMs) primarily address channel-wise outliers but often neglect token-wise outliers, which limits the accuracy of quantized models. In this work, we propose PrefixQuant, a novel quantization method that achieves state-of-the-art performance across various precision levels (W4A4KV4 and W4A8KV4) and granularities (dynamic and static quantization) by effectively isolating token-wise outliers. First, PrefixQuant eliminates token-wise outliers by prefixing outlier tokens in the KV cache, a process that is training-free and highly efficient (e.g., 1 minutes for Llama-3-70B). Second, PrefixQuant introduces new trainable parameters for block-wise training to compensate for quantization error. Our experiments show that PrefixQuant significantly outperforms existing dynamic quantization methods, even under coarser static quantization settings. For instance, PrefixQuant achieves an average accuracy improvement of +3.08 and +2.85 points over SpinQuant (dynamic quantization) on five zero-shot reasoning tasks under dynamic and static quantization settings, respectively, on W4A4KV4 Llama-3-8B. Additionally, we demonstrate up to 2.74x prefilling speedup and 2.16x decoding speedup for LLMs using W4A4 PrefixQuant. Our code is available at https://github.com/ChenMnZ/PrefixQuant.

PrefixQuant: Eliminating Outliers by Prefixed Tokens for Large Language Models Quantization

TL;DR

PrefixQuant tackles token-wise outliers in LLM activations by prefixing high-frequency outlier tokens in the KV cache, confinement of outliers to prefixed positions, and offline prefilling, enabling efficient, training-free outlier handling plus block-wise fine-tuning with trainable quantizers. The approach yields state-of-the-art or competitive accuracy across dynamic and static quantization settings (W4A4KV4, W4A8KV4) and multiple model families, while delivering substantial end-to-end speedups in prefilling and decoding. Through extensive ablations and comparisons, PrefixQuant demonstrates the effectiveness of prefixed tokens, content-aware prefixed selection, and per-block optimization in reducing quantization error and maintaining performance. The method is compatible with existing quantization schemes and shows broad practical impact for deploying LLMs with lower precision on real hardware.

Abstract

Existing weight-activation quantization methods for Large Language Models (LLMs) primarily address channel-wise outliers but often neglect token-wise outliers, which limits the accuracy of quantized models. In this work, we propose PrefixQuant, a novel quantization method that achieves state-of-the-art performance across various precision levels (W4A4KV4 and W4A8KV4) and granularities (dynamic and static quantization) by effectively isolating token-wise outliers. First, PrefixQuant eliminates token-wise outliers by prefixing outlier tokens in the KV cache, a process that is training-free and highly efficient (e.g., 1 minutes for Llama-3-70B). Second, PrefixQuant introduces new trainable parameters for block-wise training to compensate for quantization error. Our experiments show that PrefixQuant significantly outperforms existing dynamic quantization methods, even under coarser static quantization settings. For instance, PrefixQuant achieves an average accuracy improvement of +3.08 and +2.85 points over SpinQuant (dynamic quantization) on five zero-shot reasoning tasks under dynamic and static quantization settings, respectively, on W4A4KV4 Llama-3-8B. Additionally, we demonstrate up to 2.74x prefilling speedup and 2.16x decoding speedup for LLMs using W4A4 PrefixQuant. Our code is available at https://github.com/ChenMnZ/PrefixQuant.
Paper Structure (27 sections, 3 equations, 20 figures, 15 tables)

This paper contains 27 sections, 3 equations, 20 figures, 15 tables.

Figures (20)

  • Figure 1: 4-bit per-token dynamic quantization error in 2048 input context length. Two outlier tokens account for 94.7% of quantization error , while the remaining 2046 tokens contribute only 5.4%. Quantization error is measured in the output of Llama-2-7B 2-nd transformer block through mean square error (MSE).
  • Figure 2: Comparison of proposed PrefixQuant with existing methods. This figure shows the intermediate input activation of the 2-nd down_proj linear layer in Llama-2-7B using different methods. Quantization error is measured in the output of Llama-2-7B 2-nd transformer block through mean square error with 4-bit per-token dynamic quantization. The original distribution has significant outliers larger than 1,500 (left), leading 54.63 quantization error. The previous method with Hadamard rotation quarot reduces outliers to nearly 15 (middle) but still suffers from 7.88 quantization error. We propose PrefixQuant (right), which prefixes some specific tokens in KV cache to isolate outliers, reducing the maximum to nearly 0.07, significantly improving quantization error to 0.04.
  • Figure 3: Example of token-wise outliers. We present (I)(II) upper outliers and (III) lower outliers. Top-1, Medium, Min-1 indicate the largest, median, and smallest values among token-wise maximum values, respectively. We also calculate the the ratios of $\frac{\text{Top-1}}{\text{Median}}$ and $\frac{\text{Median}}{\text{Min-1}}$ in each layer, and report the maximum ratio across all layers . A lower ratio indicates a more uniform distribution. we take Llama-2-7B as an example here, more visualizatiosn about other models can be find in Sec.\ref{['sec:more_visualization']}.
  • Figure 4: Explorations of outlier tokens in Llama-2-7B. (a) Outlier token only exits in nearly 2 positions in the overall input sequence. (b) Excluding token in position 0, outlier tokens only exits in '." or "\\ n" tokens. (c) Outlier tokens consistently occur in the starting token (position 0) and another front but un-predictable position index. (d) Prefixing the input sequence with high-frequency outlier tokens (".\\ n") can constraint the outlier tokens only exit in position 0 and 1.
  • Figure 5: Prefixed tokens in KV cache across different models. [BOS] indicates the special token for beginning of sequence(e.g. "$<$s$>$" for Llama-2 and "$|$begin_of_text$|$" for Llama-3). Note that the following "" represents space.
  • ...and 15 more figures