Mitigating the Impact of Outlier Channels for Language Model Quantization with Activation Regularization
Aniruddha Nrusimha, Mayank Mishra, Naigang Wang, Dan Alistarh, Rameswar Panda, Yoon Kim
TL;DR
This paper addresses the challenge of 4-bit activation quantization in large language models by studying outlier channels that emerge early in pretraining, especially in residual-stream paths. It proposes a simple activation-regularization strategy: use quantization-aware training with learned input clipping values and add kurtosis regularization on layer outputs to suppress heavy-tailed distributions, mitigating the migration of quantization difficulty to weights. The combined approach enables a $W4A4$ model that approaches the standard $W16A16$ perplexity on moderate-scale LLMs trained on 20B tokens, with weight PTQ remaining feasible via RTN or GPTQ, and shows the importance of early interventions over post-hoc methods. This work demonstrates that activation regularization can substantially improve 4-bit LLM quantization, offering practical gains in memory efficiency and inference speed on hardware with native low-bitwidth support.
Abstract
We consider the problem of accurate quantization for language models, where both the weights and activations are uniformly quantized to 4 bits per parameter, the lowest bitwidth format natively supported by GPU hardware. In this context, the key challenge is activation quantization: it is known that language models contain outlier channels whose values on average are orders of magnitude higher than than other channels, which prevents accurate low-bitwidth quantization with known techniques. We systematically study this phenomena and find that these outlier channels emerge early in training, and that they occur more frequently in layers with residual streams. We then propose a simple strategy which regularizes a layer's inputs via quantization-aware training (QAT) and its outputs via activation kurtosis regularization. We show that regularizing both the inputs and outputs is crucial for preventing a model's "migrating" the difficulty in input quantization to the weights, which makes post-training quantization (PTQ) of weights more difficult. When combined with weight PTQ, we show that our approach can obtain a W4A4 model that performs competitively to the standard-precision W16A16 baseline.
