PrefixQuant: Eliminating Outliers by Prefixed Tokens for Large Language Models Quantization
Mengzhao Chen, Yi Liu, Jiahao Wang, Yi Bin, Wenqi Shao, Ping Luo
TL;DR
PrefixQuant tackles token-wise outliers in LLM activations by prefixing high-frequency outlier tokens in the KV cache, confinement of outliers to prefixed positions, and offline prefilling, enabling efficient, training-free outlier handling plus block-wise fine-tuning with trainable quantizers. The approach yields state-of-the-art or competitive accuracy across dynamic and static quantization settings (W4A4KV4, W4A8KV4) and multiple model families, while delivering substantial end-to-end speedups in prefilling and decoding. Through extensive ablations and comparisons, PrefixQuant demonstrates the effectiveness of prefixed tokens, content-aware prefixed selection, and per-block optimization in reducing quantization error and maintaining performance. The method is compatible with existing quantization schemes and shows broad practical impact for deploying LLMs with lower precision on real hardware.
Abstract
Existing weight-activation quantization methods for Large Language Models (LLMs) primarily address channel-wise outliers but often neglect token-wise outliers, which limits the accuracy of quantized models. In this work, we propose PrefixQuant, a novel quantization method that achieves state-of-the-art performance across various precision levels (W4A4KV4 and W4A8KV4) and granularities (dynamic and static quantization) by effectively isolating token-wise outliers. First, PrefixQuant eliminates token-wise outliers by prefixing outlier tokens in the KV cache, a process that is training-free and highly efficient (e.g., 1 minutes for Llama-3-70B). Second, PrefixQuant introduces new trainable parameters for block-wise training to compensate for quantization error. Our experiments show that PrefixQuant significantly outperforms existing dynamic quantization methods, even under coarser static quantization settings. For instance, PrefixQuant achieves an average accuracy improvement of +3.08 and +2.85 points over SpinQuant (dynamic quantization) on five zero-shot reasoning tasks under dynamic and static quantization settings, respectively, on W4A4KV4 Llama-3-8B. Additionally, we demonstrate up to 2.74x prefilling speedup and 2.16x decoding speedup for LLMs using W4A4 PrefixQuant. Our code is available at https://github.com/ChenMnZ/PrefixQuant.
