Table of Contents
Fetching ...

QuantTune: Optimizing Model Quantization with Adaptive Outlier-Driven Fine Tuning

Jiun-Man Chen, Yu-Hsuan Chao, Yu-Jie Wang, Ming-Der Shieh, Chih-Chung Hsu, Wei-Fen Lin

TL;DR

QuantTune is proposed, a quantization-friendly fine-tuning method based on restricting the dynamic range amplification effect of outliers across Transformer-based models using the proposed outlier-driven loss, which achieves significant improvements across Transformer-based models, using naive PTQ only.

Abstract

Transformer-based models have gained widespread popularity in both the computer vision (CV) and natural language processing (NLP) fields. However, significant challenges arise during post-training linear quantization, leading to noticeable reductions in inference accuracy. Our study focuses on uncovering the underlying causes of these accuracy drops and proposing a quantization-friendly fine-tuning method, \textbf{QuantTune}. Firstly, our analysis revealed that, on average, 65\% of quantization errors result from the precision loss incurred by the dynamic range amplification effect of outliers across the target Transformer-based models. Secondly, \textbf{QuantTune} adjusts weights based on the deviation of outlier activations and effectively constrains the dynamic ranges of the problematic activations. As a result, it successfully mitigates the negative impact of outliers on the inference accuracy of quantized models. Lastly, \textbf{QuantTune} can be seamlessly integrated into the back-propagation pass in the fine-tuning process without requiring extra complexity in inference software and hardware design. Our approach showcases significant improvements in post-training quantization across a range of Transformer-based models, including ViT, Bert-base, and OPT. QuantTune reduces accuracy drops by 12.09\% at 8-bit quantization and 33.8\% at 7-bit compared to top calibration methods, outperforming state-of-the-art solutions by over 18.84\% across ViT models.

QuantTune: Optimizing Model Quantization with Adaptive Outlier-Driven Fine Tuning

TL;DR

QuantTune is proposed, a quantization-friendly fine-tuning method based on restricting the dynamic range amplification effect of outliers across Transformer-based models using the proposed outlier-driven loss, which achieves significant improvements across Transformer-based models, using naive PTQ only.

Abstract

Transformer-based models have gained widespread popularity in both the computer vision (CV) and natural language processing (NLP) fields. However, significant challenges arise during post-training linear quantization, leading to noticeable reductions in inference accuracy. Our study focuses on uncovering the underlying causes of these accuracy drops and proposing a quantization-friendly fine-tuning method, \textbf{QuantTune}. Firstly, our analysis revealed that, on average, 65\% of quantization errors result from the precision loss incurred by the dynamic range amplification effect of outliers across the target Transformer-based models. Secondly, \textbf{QuantTune} adjusts weights based on the deviation of outlier activations and effectively constrains the dynamic ranges of the problematic activations. As a result, it successfully mitigates the negative impact of outliers on the inference accuracy of quantized models. Lastly, \textbf{QuantTune} can be seamlessly integrated into the back-propagation pass in the fine-tuning process without requiring extra complexity in inference software and hardware design. Our approach showcases significant improvements in post-training quantization across a range of Transformer-based models, including ViT, Bert-base, and OPT. QuantTune reduces accuracy drops by 12.09\% at 8-bit quantization and 33.8\% at 7-bit compared to top calibration methods, outperforming state-of-the-art solutions by over 18.84\% across ViT models.
Paper Structure (19 sections, 3 equations, 7 figures, 4 tables)

This paper contains 19 sections, 3 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Flowchart of the proposed QuantTune method, highlighting the use of the activation Observer to compute the outlier-driven loss, which mitigates outliers and reduces the dynamic range. The red line indicates the insertion point of the outlier observer.
  • Figure 2: Comparative analysis of activation distributions across different Transformer models. Boxplots show activation value ranges in grouped channels for ViT (base), ViT (small), DeiT (base), Swin (base), BERT (base), and OPT (350m). Color denotes the activation value range, with red indicating the widest range. Data was segmented into 30-group segments for consistent comparison.
  • Figure 3: Accuracies and errors vary with different saturation thresholds across various ViT-relative models (left: ViT-base, middle: DeiT-tiny, right: Swin-tiny). The line chart displays the accuracies of ImageNet-1K corresponding to different saturation percentages. The bar chart illustrates two forms of error resulting from quantization: saturation error (blue bar) and precision loss error (red bar).
  • Figure 4: Precision loss error and dynamic range of each block in the ViT-Base model. The red line chart shows the dynamic range before saturation, while the blue line illustrates the dynamic range after saturation. The bar chart demonstrates the relative precision loss error, which equals the sum of the KL-divergence between full-precision and quantized tensors.
  • Figure 5: Performance of Transformer-based models with different calibration methods. This bar chart compares the top-1 accuracy on ImageNet-1K for various ViT architectures following different calibration methods.
  • ...and 2 more figures