Table of Contents
Fetching ...

RepQuant: Towards Accurate Post-Training Quantization of Large Transformer Models via Scale Reparameterization

Zhikai Li, Xuewen Liu, Jing Zhang, Qingyi Gu

TL;DR

RepQuant tackles the PTQ challenge for very large transformers by decoupling quantization from inference and using complex quantizers during calibration, bridged to hardware-friendly quantizers for inference via scale reparameterization. It targets extreme activation distributions in LayerNorm and Softmax with channel-wise quantization and $\log(\sqrt{2})$ quantization, augmented by learnable per-channel dual clipping and integrated weight reconstruction. The approach delivers robust gains across vision, language, and multi-modal transformers, achieving near-full-precision performance at W6/A6 and substantial improvements at W4/A4 compared with prior PTQ methods. The framework is generic, scalable, and compatible with models such as ViT, OPT, LLaMA, CLIP, and SAM, enabling practical deployment of large transformers.

Abstract

Large transformer models have demonstrated remarkable success. Post-training quantization (PTQ), which requires only a small dataset for calibration and avoids end-to-end retraining, is a promising solution for compressing these large models. Regrettably, existing PTQ methods typically exhibit non-trivial performance loss. We find that the performance bottleneck stems from over-consideration of hardware compatibility in the quantization process, compelling them to reluctantly employ simple quantizers, albeit at the expense of accuracy. With the above insights, we propose RepQuant, a novel PTQ framework with quantization-inference decoupling paradigm to address the above issues. RepQuant employs complex quantizers in the quantization process and simplified quantizers in the inference process, and performs mathematically equivalent transformations between the two through quantization scale reparameterization, thus ensuring both accurate quantization and efficient inference. More specifically, we focus on two components with extreme distributions: LayerNorm activations and Softmax activations. Initially, we apply channel-wise quantization and log$\sqrt{2}$ quantization, respectively, which are tailored to their distributions. In particular, for the former, we introduce a learnable per-channel dual clipping scheme, which is designed to efficiently identify outliers in the unbalanced activations with fine granularity. Then, we reparameterize the scales to hardware-friendly layer-wise quantization and log2 quantization for inference. Moreover, quantized weight reconstruction is seamlessly integrated into the above procedure to further push the performance limits. Extensive experiments are performed on different large-scale transformer variants on multiple tasks, including vision, language, and multi-modal transformers, and RepQuant encouragingly demonstrates significant performance advantages.

RepQuant: Towards Accurate Post-Training Quantization of Large Transformer Models via Scale Reparameterization

TL;DR

RepQuant tackles the PTQ challenge for very large transformers by decoupling quantization from inference and using complex quantizers during calibration, bridged to hardware-friendly quantizers for inference via scale reparameterization. It targets extreme activation distributions in LayerNorm and Softmax with channel-wise quantization and quantization, augmented by learnable per-channel dual clipping and integrated weight reconstruction. The approach delivers robust gains across vision, language, and multi-modal transformers, achieving near-full-precision performance at W6/A6 and substantial improvements at W4/A4 compared with prior PTQ methods. The framework is generic, scalable, and compatible with models such as ViT, OPT, LLaMA, CLIP, and SAM, enabling practical deployment of large transformers.

Abstract

Large transformer models have demonstrated remarkable success. Post-training quantization (PTQ), which requires only a small dataset for calibration and avoids end-to-end retraining, is a promising solution for compressing these large models. Regrettably, existing PTQ methods typically exhibit non-trivial performance loss. We find that the performance bottleneck stems from over-consideration of hardware compatibility in the quantization process, compelling them to reluctantly employ simple quantizers, albeit at the expense of accuracy. With the above insights, we propose RepQuant, a novel PTQ framework with quantization-inference decoupling paradigm to address the above issues. RepQuant employs complex quantizers in the quantization process and simplified quantizers in the inference process, and performs mathematically equivalent transformations between the two through quantization scale reparameterization, thus ensuring both accurate quantization and efficient inference. More specifically, we focus on two components with extreme distributions: LayerNorm activations and Softmax activations. Initially, we apply channel-wise quantization and log quantization, respectively, which are tailored to their distributions. In particular, for the former, we introduce a learnable per-channel dual clipping scheme, which is designed to efficiently identify outliers in the unbalanced activations with fine granularity. Then, we reparameterize the scales to hardware-friendly layer-wise quantization and log2 quantization for inference. Moreover, quantized weight reconstruction is seamlessly integrated into the above procedure to further push the performance limits. Extensive experiments are performed on different large-scale transformer variants on multiple tasks, including vision, language, and multi-modal transformers, and RepQuant encouragingly demonstrates significant performance advantages.
Paper Structure (23 sections, 15 equations, 11 figures, 10 tables, 1 algorithm)

This paper contains 23 sections, 15 equations, 11 figures, 10 tables, 1 algorithm.

Figures (11)

  • Figure 1: Comparison of different paradigms, with an example of quantization granularity for LayerNorm activations. Our proposed quantization-inference decoupling paradigm shows significant advantages.
  • Figure 2: Overview of the proposed RepQuant framework. Based on the quantization-inference decoupling paradigm, we initially apply channel-wise quantization for LayerNorm activations with severe inter-channel variations and log$\sqrt{2}$ quantization for Softmax activations with power-law characteristics in the quantization process, and then we simplify them to layer-wise quantization and log2 quantization via scale reparameterization, respectively, in the inference process, which can ensure both accurate quantization and efficient inference.
  • Figure 3: Analysis of performance bottlenecks of quantizing activations in DeiT-S. Evidently, the activations of LayerNorm and Softmax are the most significant obstacles that limit the quantization performance, posing great challenges for low-bit quantization.
  • Figure 4: Boxplots of different channels of the first module’s LayerNorm activations in DeiT-S and LLaMA-7B. As we can see, the range varies significantly across channels, with tens of times the variation between the maximum and minimum ranges.
  • Figure 5: Histograms of the first module’s Softmax activations in DeiT-S and LLaMA-7B. It can be clearly seen that the distributions are extremely unbalanced, with the vast majority concentrated on small values and a few dispersed on large values.
  • ...and 6 more figures