Table of Contents
Fetching ...

Quantization Variation: A New Perspective on Training Transformers with Low-Bit Precision

Xijie Huang, Zhiqiang Shen, Pingcheng Dong, Kwang-Ting Cheng

TL;DR

This paper identifies the difficulty of transformer low-bit quantization-aware training on its unique variation behaviors, which significantly differ from ConvNets, and proposes a variation-aware quantization scheme that can alleviate the variation and improve the performance of transformers across various models and tasks.

Abstract

Despite the outstanding performance of transformers in both language and vision tasks, the expanding computation and model size have increased the demand for efficient deployment. To address the heavy computation and parameter drawbacks, quantization is frequently studied in the community as a representative model compression technique and has seen extensive use on ConvNets. However, due to the unique properties of transformers, the low-bit quantization applications are still limited and underexplored. In this paper, we identify the difficulty of transformer low-bit quantization-aware training on its unique variation behaviors, which significantly differ from ConvNets. Based on comprehensive quantitative analysis, we observe variation in three hierarchies: various module quantization sensitivities, outliers in static weight and activation distribution, and oscillation in dynamic parameter fluctuations. These variations of transformers bring instability to the quantization-aware training (QAT) and negatively influence the performance. We explore the best practices to alleviate the variation's influence during low-bit transformer QAT and propose a variation-aware quantization scheme for both vision and language transformers. We extensively verify and show our scheme can alleviate the variation and improve the performance of transformers across various models and tasks. Our solution substantially improves the 2-bit Swin-T and binary BERT-base, achieving a 3.35% and 1.4% accuracy improvement over previous state-of-the-art methods on ImageNet-1K and GLUE. Codes and models are available at https://github.com/HuangOwen/Quantization-Variation.

Quantization Variation: A New Perspective on Training Transformers with Low-Bit Precision

TL;DR

This paper identifies the difficulty of transformer low-bit quantization-aware training on its unique variation behaviors, which significantly differ from ConvNets, and proposes a variation-aware quantization scheme that can alleviate the variation and improve the performance of transformers across various models and tasks.

Abstract

Despite the outstanding performance of transformers in both language and vision tasks, the expanding computation and model size have increased the demand for efficient deployment. To address the heavy computation and parameter drawbacks, quantization is frequently studied in the community as a representative model compression technique and has seen extensive use on ConvNets. However, due to the unique properties of transformers, the low-bit quantization applications are still limited and underexplored. In this paper, we identify the difficulty of transformer low-bit quantization-aware training on its unique variation behaviors, which significantly differ from ConvNets. Based on comprehensive quantitative analysis, we observe variation in three hierarchies: various module quantization sensitivities, outliers in static weight and activation distribution, and oscillation in dynamic parameter fluctuations. These variations of transformers bring instability to the quantization-aware training (QAT) and negatively influence the performance. We explore the best practices to alleviate the variation's influence during low-bit transformer QAT and propose a variation-aware quantization scheme for both vision and language transformers. We extensively verify and show our scheme can alleviate the variation and improve the performance of transformers across various models and tasks. Our solution substantially improves the 2-bit Swin-T and binary BERT-base, achieving a 3.35% and 1.4% accuracy improvement over previous state-of-the-art methods on ImageNet-1K and GLUE. Codes and models are available at https://github.com/HuangOwen/Quantization-Variation.
Paper Structure (25 sections, 12 equations, 7 figures, 15 tables)

This paper contains 25 sections, 12 equations, 7 figures, 15 tables.

Figures (7)

  • Figure 1: Left: ImageNet-1K Top-1 accuracy vs. BitOPs comparison of 2/3/4-bit quantized ViT models using LSQ+ bhalgat2020lsq+ and our method. Right: GLUE performance comparison of different binary (1-1-1-bit) BERT models.
  • Figure 2: An overview of the variation in transformers of different hierarchies: various quantization sensitivities of different modules, outlier in weight and activation distributions, and oscillation phenomenon in dynamic parameter updates.
  • Figure 3: The accuracy degradation compared to the full-precision model when a specific head in a layer of Transformer is quantized. The label $h\text{-}l$ in abscissa indicates the head $h$ in layer $l$ is quantized.
  • Figure 4: The weight distribution during QAT and the weight oscillation effect due to distribution variance. The layer we select is blocks.1.attn.proj-v.weight in 4-bit quantized DeiT-S with scale $\alpha=0.0077$.
  • Figure 5: Loss landscape visualization of the 4-bit quantized Swin-T using the baseline (LSQ+ quantization) method and our module-dependent quantization method.
  • ...and 2 more figures