Table of Contents
Fetching ...

Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference

Ke Yi, Zengke Liu, Jianwei Zhang, Chengyuan Li, Tong Zhang, Junyang Lin, Jingren Zhou

TL;DR

Rotated Runtime Smooth (RRS) is introduced, a plug-and-play activation smoother for quantization, consisting of Runtime Smooth and the Rotation operation, introduced to eliminate channel-wise outliers by smoothing activations with channel-wise maximums during runtime.

Abstract

Large language models have demonstrated promising capabilities upon scaling up parameters. However, serving large language models incurs substantial computation and memory movement costs due to their large scale. Quantization methods have been employed to reduce service costs and latency. Nevertheless, outliers in activations hinder the development of INT4 weight-activation quantization. Existing approaches separate outliers and normal values into two matrices or migrate outliers from activations to weights, suffering from high latency or accuracy degradation. Based on observing activations from large language models, outliers can be classified into channel-wise and spike outliers. In this work, we propose Rotated Runtime Smooth (RRS), a plug-and-play activation smoother for quantization, consisting of Runtime Smooth and the Rotation operation. Runtime Smooth (RS) is introduced to eliminate channel-wise outliers by smoothing activations with channel-wise maximums during runtime. The rotation operation can narrow the gap between spike outliers and normal values, alleviating the effect of victims caused by channel-wise smoothing. The proposed method outperforms the state-of-the-art method in the LLaMA and Qwen families and improves WikiText-2 perplexity from 57.33 to 6.66 for INT4 inference.

Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference

TL;DR

Rotated Runtime Smooth (RRS) is introduced, a plug-and-play activation smoother for quantization, consisting of Runtime Smooth and the Rotation operation, introduced to eliminate channel-wise outliers by smoothing activations with channel-wise maximums during runtime.

Abstract

Large language models have demonstrated promising capabilities upon scaling up parameters. However, serving large language models incurs substantial computation and memory movement costs due to their large scale. Quantization methods have been employed to reduce service costs and latency. Nevertheless, outliers in activations hinder the development of INT4 weight-activation quantization. Existing approaches separate outliers and normal values into two matrices or migrate outliers from activations to weights, suffering from high latency or accuracy degradation. Based on observing activations from large language models, outliers can be classified into channel-wise and spike outliers. In this work, we propose Rotated Runtime Smooth (RRS), a plug-and-play activation smoother for quantization, consisting of Runtime Smooth and the Rotation operation. Runtime Smooth (RS) is introduced to eliminate channel-wise outliers by smoothing activations with channel-wise maximums during runtime. The rotation operation can narrow the gap between spike outliers and normal values, alleviating the effect of victims caused by channel-wise smoothing. The proposed method outperforms the state-of-the-art method in the LLaMA and Qwen families and improves WikiText-2 perplexity from 57.33 to 6.66 for INT4 inference.
Paper Structure (23 sections, 6 equations, 9 figures, 4 tables)

This paper contains 23 sections, 6 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Challenges of SmoothQuant faced with outliers. (a) The unmatched $s$ is useless for smoothing. (b) Based on the migration scheme, smoothed activation/weight is still hard to quantize to 4-bit. (c) Normal values are pruned as victims after smoothing due to the spike outlier.
  • Figure 2: Review of the rotation-based method. (a) illustrates a simple implementation of the rotation-based method. The output from projector is not changed since $\mathbf{Y} = (\mathbf{X}\mathbf{R})(\mathbf{R}^{-1} \mathbf{W}^{T}) = \mathbf{X}\mathbf{W}^{T}$. (b) explains the success of the rotation-based method on LLMs, where activations have high confidentiality to be smoothed after rotation compared with a random matrix. (c) illustrates that activation with channel-wise outliers still maintains channel-wise consistency after rotation, leaving space for further smoothing
  • Figure 3: Preliminary ablation study
  • Figure 4: Pipeline of Runtime Smooth. (1) Reorder activations and weight according to channel-wise maximums of activation. (2) Group up activations according to block size of matrix multiplication computation. The maximums of the group are set to the runtime smoothing scale of the group. (3) In the matrix multiplication pipeline, quantized smoothed activations and weights are segmented into blocks. The block size is equivalent to the previous group size. Within a block, tiled smoothed activations are multiplied by tiled quantized weights. The runtime smoothing scales are applied to the dequantized interim result.
  • Figure 5: Analysis of rotated activations with different outliers. (a) Activation with channel-wise outliers maintains channel-wise consistency after rotation, hence being sub-smooth for quantization. (b) One spike outlier is spread on its token internal, where the smoothing scale is consistent without abnormal value, further preventing 'victim'.
  • ...and 4 more figures