Table of Contents
Fetching ...

Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling

Xiuying Wei, Yunchen Zhang, Yuhang Li, Xiangguo Zhang, Ruihao Gong, Jinyang Guo, Xianglong Liu

TL;DR

Post-training quantization of transformer models is hindered by channel-specific outliers that are both aggregated and asymmetric. OS+ introduces channel-wise shifting to align channel centers and channel-wise scaling to compress outlier ranges, plus a migration framework to preserve FP-equivalence and an efficient grid-search-based method for optimal factors. The approach delivers near-FP performance at 8- and 6-bit quantization, sets a new 4-bit BERT state-of-the-art, and generalizes to OPT, BLOOM, BLOOMZ, and LLaMA with modest computational overhead. This work enables practical static PTQ for large language models with minimal inference overhead and provides a foundation for further enhancements and broader applicability.

Abstract

Post-training quantization~(PTQ) of transformer language models faces significant challenges due to the existence of detrimental outliers in activations. We observe that these outliers are concentrated in specific channels and are asymmetric across channels. To address this issue, we propose the Outlier Suppression+~(OS+) framework, which contains the channel-wise shifting for asymmetry and channel-wise scaling for concentration. We show that these operations can be seamlessly migrated into subsequent modules while maintaining equivalence. Second, we propose a fast and stable scheme to calculate effective shifting and scaling values. The channel-wise shifting aligns the center of each channel for removal of outlier asymmetry. The channel-wise scaling quantitatively evaluates changes brought by migration and quantization for better quantization burden balance. We validate our OS+ under both standard and fine-grained quantization settings with models including BERT, OPT, BLOOM, BLOOMZ, and LLaMA. Comprehensive results across various tasks demonstrate the superiority of our approach. Especially, with standard quantization, OS+ can achieve near-floating-point performance on both small models and large language models on 8-bit and 6-bit. Besides, we establish a new state-of-the-art for 4-bit BERT with 15.5\% improvement. Our code is available at \url{https://github.com/ModelTC/Outlier_Suppression_Plus}.

Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling

TL;DR

Post-training quantization of transformer models is hindered by channel-specific outliers that are both aggregated and asymmetric. OS+ introduces channel-wise shifting to align channel centers and channel-wise scaling to compress outlier ranges, plus a migration framework to preserve FP-equivalence and an efficient grid-search-based method for optimal factors. The approach delivers near-FP performance at 8- and 6-bit quantization, sets a new 4-bit BERT state-of-the-art, and generalizes to OPT, BLOOM, BLOOMZ, and LLaMA with modest computational overhead. This work enables practical static PTQ for large language models with minimal inference overhead and provides a foundation for further enhancements and broader applicability.

Abstract

Post-training quantization~(PTQ) of transformer language models faces significant challenges due to the existence of detrimental outliers in activations. We observe that these outliers are concentrated in specific channels and are asymmetric across channels. To address this issue, we propose the Outlier Suppression+~(OS+) framework, which contains the channel-wise shifting for asymmetry and channel-wise scaling for concentration. We show that these operations can be seamlessly migrated into subsequent modules while maintaining equivalence. Second, we propose a fast and stable scheme to calculate effective shifting and scaling values. The channel-wise shifting aligns the center of each channel for removal of outlier asymmetry. The channel-wise scaling quantitatively evaluates changes brought by migration and quantization for better quantization burden balance. We validate our OS+ under both standard and fine-grained quantization settings with models including BERT, OPT, BLOOM, BLOOMZ, and LLaMA. Comprehensive results across various tasks demonstrate the superiority of our approach. Especially, with standard quantization, OS+ can achieve near-floating-point performance on both small models and large language models on 8-bit and 6-bit. Besides, we establish a new state-of-the-art for 4-bit BERT with 15.5\% improvement. Our code is available at \url{https://github.com/ModelTC/Outlier_Suppression_Plus}.
Paper Structure (22 sections, 8 equations, 4 figures, 4 tables)

This paper contains 22 sections, 8 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: \ref{['fig_original']} shows asymmetric and aggregated outliers in certain channels, making great interval (-97, 43) for the whole tensor. \ref{['fig_shifting']} applies a channel-wise shifting operation to decrease the tensor range by eliminating the asymmetry property. \ref{['fig_scaling']} further scales down the outliers to threshold 5 and finally results in distribution ranging from -5 to 5.
  • Figure 2: Algorithm flow. The left part gives two examples about shifting and scaling transformation on problematic activation and following modules. The right part describes the optimal calculation of these two factors. In detail, in the left part, LN' means the LayerNorm with ${\bm{z}}$ and ${\bm{s}}$ absorbed and brings output $\widetilde{{\bm{X}}}=({\bm{X}}-{\bm{z}})\oslash {\bm{s}}$. Specially, (a) illustrates the weight transformation in multi-head attention with three linear layers. (b) depicts the equivalent transformation with residual connection quantized and adopted after LayerNorm. In the right part, ${\bm{z}}$ is first calculated to eliminate the asymmetry of outliers. Then, $t^{*}$ denoting the shrink boundary is searched (grid search) based on its left metric involving both the quantization performance and interactive relationship between weight and activation. Thus, we can obtain the corresponding ${\bm{s}}$ that scales down the outliers to the boundary. Detailed algorithm is given in \ref{['algo_']}.
  • Figure 3: Averaged results of quantized BERT model on GLUE benchmark where the accuracy of FP model is 83.8. Detailed performance please see \ref{['appendix_exp_bert']}.
  • Figure 4: Latency comparison among our quantized model, standard 8-bit and FP16 ones. It gives the real latency (x-axis) on A100 across different batch sizes (y-axis) and distinct models where BERT-large-256 means BERT model in base version with sequence length set to 128. Bold number indicates the speedup ratio brought by quantization. Note that only on BERT models, our method would bring extra computation burden and can be viewed negligible.