Table of Contents
Fetching ...

Mitigating the Impact of Outlier Channels for Language Model Quantization with Activation Regularization

Aniruddha Nrusimha, Mayank Mishra, Naigang Wang, Dan Alistarh, Rameswar Panda, Yoon Kim

TL;DR

This paper addresses the challenge of 4-bit activation quantization in large language models by studying outlier channels that emerge early in pretraining, especially in residual-stream paths. It proposes a simple activation-regularization strategy: use quantization-aware training with learned input clipping values and add kurtosis regularization on layer outputs to suppress heavy-tailed distributions, mitigating the migration of quantization difficulty to weights. The combined approach enables a $W4A4$ model that approaches the standard $W16A16$ perplexity on moderate-scale LLMs trained on 20B tokens, with weight PTQ remaining feasible via RTN or GPTQ, and shows the importance of early interventions over post-hoc methods. This work demonstrates that activation regularization can substantially improve 4-bit LLM quantization, offering practical gains in memory efficiency and inference speed on hardware with native low-bitwidth support.

Abstract

We consider the problem of accurate quantization for language models, where both the weights and activations are uniformly quantized to 4 bits per parameter, the lowest bitwidth format natively supported by GPU hardware. In this context, the key challenge is activation quantization: it is known that language models contain outlier channels whose values on average are orders of magnitude higher than than other channels, which prevents accurate low-bitwidth quantization with known techniques. We systematically study this phenomena and find that these outlier channels emerge early in training, and that they occur more frequently in layers with residual streams. We then propose a simple strategy which regularizes a layer's inputs via quantization-aware training (QAT) and its outputs via activation kurtosis regularization. We show that regularizing both the inputs and outputs is crucial for preventing a model's "migrating" the difficulty in input quantization to the weights, which makes post-training quantization (PTQ) of weights more difficult. When combined with weight PTQ, we show that our approach can obtain a W4A4 model that performs competitively to the standard-precision W16A16 baseline.

Mitigating the Impact of Outlier Channels for Language Model Quantization with Activation Regularization

TL;DR

This paper addresses the challenge of 4-bit activation quantization in large language models by studying outlier channels that emerge early in pretraining, especially in residual-stream paths. It proposes a simple activation-regularization strategy: use quantization-aware training with learned input clipping values and add kurtosis regularization on layer outputs to suppress heavy-tailed distributions, mitigating the migration of quantization difficulty to weights. The combined approach enables a model that approaches the standard perplexity on moderate-scale LLMs trained on 20B tokens, with weight PTQ remaining feasible via RTN or GPTQ, and shows the importance of early interventions over post-hoc methods. This work demonstrates that activation regularization can substantially improve 4-bit LLM quantization, offering practical gains in memory efficiency and inference speed on hardware with native low-bitwidth support.

Abstract

We consider the problem of accurate quantization for language models, where both the weights and activations are uniformly quantized to 4 bits per parameter, the lowest bitwidth format natively supported by GPU hardware. In this context, the key challenge is activation quantization: it is known that language models contain outlier channels whose values on average are orders of magnitude higher than than other channels, which prevents accurate low-bitwidth quantization with known techniques. We systematically study this phenomena and find that these outlier channels emerge early in training, and that they occur more frequently in layers with residual streams. We then propose a simple strategy which regularizes a layer's inputs via quantization-aware training (QAT) and its outputs via activation kurtosis regularization. We show that regularizing both the inputs and outputs is crucial for preventing a model's "migrating" the difficulty in input quantization to the weights, which makes post-training quantization (PTQ) of weights more difficult. When combined with weight PTQ, we show that our approach can obtain a W4A4 model that performs competitively to the standard-precision W16A16 baseline.
Paper Structure (23 sections, 5 equations, 6 figures, 4 tables, 2 algorithms)

This paper contains 23 sections, 5 equations, 6 figures, 4 tables, 2 algorithms.

Figures (6)

  • Figure 1: (Top) Average of the absolute activation values of a KV projection layer for a 1B language model trained with (a) standard training, (b) QAT with learned clipping values in the input layer, and (c) QAT on the inputs and kurtosis regularization on the layer's outputs. For the QAT runs, we show the learned clip value as a green 2d manifold. (Bottom) Parameter values of individual weights in the KV projection of the same layer corresponding to each model after training. QAT-only training results in the model's weights' becoming harder to quantize, whereas kurtosis regularization mitigates this.
  • Figure 2: Frequency of outlier channels over the course of training. (Left) Proportion of outlier channels by layer depth. Layer 1 has highest occurrence of outlier channels. (Middle) In layer 1 inputs to the attention projection layer have the most outlier channels. (Right) This is generally not the case for the other layers, where the input to the QKV project layer has the most outlier channels.
  • Figure 3: Trajectory of a channel's activations across 50B tokens of training. We show each channel's absolute activation value averaged across 500K tokens.
  • Figure 4: The distribution of activations over of a non-outlier channel (left) and two outlier channels (middle, right) over training.
  • Figure 5: Activation development in two open-source models: Pythia 6.9b biderman2023pythia and OLMo 7B groeneveld2024olmo. We show activations for a layer that reads from the residual stream (QKV Input) and one that does not (MLP Proj Input). Note that the OLMo data includes a step-0 checkpoint (i.e., at initialization).
  • ...and 1 more figures