Quantization Variation: A New Perspective on Training Transformers with Low-Bit Precision

Xijie Huang; Zhiqiang Shen; Pingcheng Dong; Kwang-Ting Cheng

Quantization Variation: A New Perspective on Training Transformers with Low-Bit Precision

Xijie Huang, Zhiqiang Shen, Pingcheng Dong, Kwang-Ting Cheng

TL;DR

This paper identifies the difficulty of transformer low-bit quantization-aware training on its unique variation behaviors, which significantly differ from ConvNets, and proposes a variation-aware quantization scheme that can alleviate the variation and improve the performance of transformers across various models and tasks.

Abstract

Despite the outstanding performance of transformers in both language and vision tasks, the expanding computation and model size have increased the demand for efficient deployment. To address the heavy computation and parameter drawbacks, quantization is frequently studied in the community as a representative model compression technique and has seen extensive use on ConvNets. However, due to the unique properties of transformers, the low-bit quantization applications are still limited and underexplored. In this paper, we identify the difficulty of transformer low-bit quantization-aware training on its unique variation behaviors, which significantly differ from ConvNets. Based on comprehensive quantitative analysis, we observe variation in three hierarchies: various module quantization sensitivities, outliers in static weight and activation distribution, and oscillation in dynamic parameter fluctuations. These variations of transformers bring instability to the quantization-aware training (QAT) and negatively influence the performance. We explore the best practices to alleviate the variation's influence during low-bit transformer QAT and propose a variation-aware quantization scheme for both vision and language transformers. We extensively verify and show our scheme can alleviate the variation and improve the performance of transformers across various models and tasks. Our solution substantially improves the 2-bit Swin-T and binary BERT-base, achieving a 3.35% and 1.4% accuracy improvement over previous state-of-the-art methods on ImageNet-1K and GLUE. Codes and models are available at https://github.com/HuangOwen/Quantization-Variation.

Quantization Variation: A New Perspective on Training Transformers with Low-Bit Precision

TL;DR

Abstract

Paper Structure (25 sections, 12 equations, 7 figures, 15 tables)

This paper contains 25 sections, 12 equations, 7 figures, 15 tables.

Introduction
Related Work
Preliminaries
Understanding Quantization Variation of Transformers
Quantization Sensitivity
Distribution Outlier
Weight Oscillation in Training
Best Practices of Transformer Quantization
Module-dependent Quantization
Knowledge Distillation
Oscillation-aware Bin Regularization
Experiments
Experimental Settings
Comparison with State-of-the-Art Methods
Ablation Study
...and 10 more sections

Figures (7)

Figure 1: Left: ImageNet-1K Top-1 accuracy vs. BitOPs comparison of 2/3/4-bit quantized ViT models using LSQ+ bhalgat2020lsq+ and our method. Right: GLUE performance comparison of different binary (1-1-1-bit) BERT models.
Figure 2: An overview of the variation in transformers of different hierarchies: various quantization sensitivities of different modules, outlier in weight and activation distributions, and oscillation phenomenon in dynamic parameter updates.
Figure 3: The accuracy degradation compared to the full-precision model when a specific head in a layer of Transformer is quantized. The label $h\text{-}l$ in abscissa indicates the head $h$ in layer $l$ is quantized.
Figure 4: The weight distribution during QAT and the weight oscillation effect due to distribution variance. The layer we select is blocks.1.attn.proj-v.weight in 4-bit quantized DeiT-S with scale $\alpha=0.0077$.
Figure 5: Loss landscape visualization of the 4-bit quantized Swin-T using the baseline (LSQ+ quantization) method and our module-dependent quantization method.
...and 2 more figures

Quantization Variation: A New Perspective on Training Transformers with Low-Bit Precision

TL;DR

Abstract

Quantization Variation: A New Perspective on Training Transformers with Low-Bit Precision

Authors

TL;DR

Abstract

Table of Contents

Figures (7)