Table of Contents
Fetching ...

EfQAT: An Efficient Framework for Quantization-Aware Training

Saleh Ashkboos, Bram Verhoef, Torsten Hoefler, Evangelos Eleftheriou, Martino Dazzi

TL;DR

It is shown that EfQAT is significantly more accurate than PTQ with little extra compute, and can accelerate the QAT backward pass between 1.44-1.64x while retaining most accuracy.

Abstract

Quantization-aware training (QAT) schemes have been shown to achieve near-full precision accuracy. They accomplish this by training a quantized model for multiple epochs. This is computationally expensive, mainly because of the full precision backward pass. On the other hand, post-training quantization (PTQ) schemes do not involve training and are therefore computationally cheap, but they usually result in a significant accuracy drop. We address these challenges by proposing EfQAT, which generalizes both schemes by optimizing only a subset of the parameters of a quantized model. EfQAT starts by applying a PTQ scheme to a pre-trained model and only updates the most critical network parameters while freezing the rest, accelerating the backward pass. We demonstrate the effectiveness of EfQAT on various CNNs and Transformer-based models using different GPUs. Specifically, we show that EfQAT is significantly more accurate than PTQ with little extra compute. Furthermore, EfQAT can accelerate the QAT backward pass between 1.44-1.64x while retaining most accuracy.

EfQAT: An Efficient Framework for Quantization-Aware Training

TL;DR

It is shown that EfQAT is significantly more accurate than PTQ with little extra compute, and can accelerate the QAT backward pass between 1.44-1.64x while retaining most accuracy.

Abstract

Quantization-aware training (QAT) schemes have been shown to achieve near-full precision accuracy. They accomplish this by training a quantized model for multiple epochs. This is computationally expensive, mainly because of the full precision backward pass. On the other hand, post-training quantization (PTQ) schemes do not involve training and are therefore computationally cheap, but they usually result in a significant accuracy drop. We address these challenges by proposing EfQAT, which generalizes both schemes by optimizing only a subset of the parameters of a quantized model. EfQAT starts by applying a PTQ scheme to a pre-trained model and only updates the most critical network parameters while freezing the rest, accelerating the backward pass. We demonstrate the effectiveness of EfQAT on various CNNs and Transformer-based models using different GPUs. Specifically, we show that EfQAT is significantly more accurate than PTQ with little extra compute. Furthermore, EfQAT can accelerate the QAT backward pass between 1.44-1.64x while retaining most accuracy.

Paper Structure

This paper contains 33 sections, 8 equations, 4 figures, 7 tables, 1 algorithm.

Figures (4)

  • Figure 1: Main backward matrix multiplications of the quantized layer with Symmetric weight (using scale $S_w$) and Asymmetric input (using scale $S_x$ and zero point $Z_x$) quantization. Left: Quantization-aware training applies both matrix multiplications in full precision. Right: EfQAT accelerates the backward pass by performing the matrix multiplication only over the most important/unfrozen rows (the rows with large average magnitude).
  • Figure 2: Accuracy and the performance of EfQAT-CWPN/LWPN on the ImageNet (using ResNet-50), SQuAD (using BERT$_\text{base}$), and CIFAR-10 (using ResNet-20) datasets.
  • Figure 3: The importance of different channels of convolutions in ResNet-20 (left) and the rows of the output matrix of the self-attention layers in BERT$_\text{base}$ (right). A significant amount of outliers can be noted for both networks across different layers. In both plots, the last column displays the channel/row importance for the whole network.
  • Figure 4: The role of different freezing intervals on the accuracy of EfQAT-CWPN with W8A8. We update the frozen channels every $f$ samples. Larger $f$ does not cause a large accuracy drop during the EfQAT training epoch.