Table of Contents
Fetching ...

Improving Quantization-aware Training of Low-Precision Network via Block Replacement on Full-Precision Counterpart

Chengting Yu, Shu Yang, Fengzhao Zhang, Hanzhi Ma, Aili Wang, Er-Ping Li

TL;DR

The paper tackles aggressive quantization in QAT by addressing gradient mismatch and limited representation at very low bit widths. It introduces the Block-wise Replacement Framework (BWRF), which grafts a fixed full-precision partner into a low-precision network to form intermediate mixed-precision models that guide both forward and backward passes through training. Training optimizes a joint objective that combines task loss with distillation losses from FP and MP branches, providing implicit regularization and improved gradient estimation. Empirically, BWRF delivers state-of-the-art results for 4-, 3-, and 2-bit quantization on ImageNet and CIFAR-10 under uniform quantization and remains compatible with existing QAT pipelines via a concise wrapper.

Abstract

Quantization-aware training (QAT) is a common paradigm for network quantization, in which the training phase incorporates the simulation of the low-precision computation to optimize the quantization parameters in alignment with the task goals. However, direct training of low-precision networks generally faces two obstacles: 1. The low-precision model exhibits limited representation capabilities and cannot directly replicate full-precision calculations, which constitutes a deficiency compared to full-precision alternatives; 2. Non-ideal deviations during gradient propagation are a common consequence of employing pseudo-gradients as approximations in derived quantized functions. In this paper, we propose a general QAT framework for alleviating the aforementioned concerns by permitting the forward and backward processes of the low-precision network to be guided by the full-precision partner during training. In conjunction with the direct training of the quantization model, intermediate mixed-precision models are generated through the block-by-block replacement on the full-precision model and working simultaneously with the low-precision backbone, which enables the integration of quantized low-precision blocks into full-precision networks throughout the training phase. Consequently, each quantized block is capable of: 1. simulating full-precision representation during forward passes; 2. obtaining gradients with improved estimation during backward passes. We demonstrate that the proposed method achieves state-of-the-art results for 4-, 3-, and 2-bit quantization on ImageNet and CIFAR-10. The proposed framework provides a compatible extension for most QAT methods and only requires a concise wrapper for existing codes.

Improving Quantization-aware Training of Low-Precision Network via Block Replacement on Full-Precision Counterpart

TL;DR

The paper tackles aggressive quantization in QAT by addressing gradient mismatch and limited representation at very low bit widths. It introduces the Block-wise Replacement Framework (BWRF), which grafts a fixed full-precision partner into a low-precision network to form intermediate mixed-precision models that guide both forward and backward passes through training. Training optimizes a joint objective that combines task loss with distillation losses from FP and MP branches, providing implicit regularization and improved gradient estimation. Empirically, BWRF delivers state-of-the-art results for 4-, 3-, and 2-bit quantization on ImageNet and CIFAR-10 under uniform quantization and remains compatible with existing QAT pipelines via a concise wrapper.

Abstract

Quantization-aware training (QAT) is a common paradigm for network quantization, in which the training phase incorporates the simulation of the low-precision computation to optimize the quantization parameters in alignment with the task goals. However, direct training of low-precision networks generally faces two obstacles: 1. The low-precision model exhibits limited representation capabilities and cannot directly replicate full-precision calculations, which constitutes a deficiency compared to full-precision alternatives; 2. Non-ideal deviations during gradient propagation are a common consequence of employing pseudo-gradients as approximations in derived quantized functions. In this paper, we propose a general QAT framework for alleviating the aforementioned concerns by permitting the forward and backward processes of the low-precision network to be guided by the full-precision partner during training. In conjunction with the direct training of the quantization model, intermediate mixed-precision models are generated through the block-by-block replacement on the full-precision model and working simultaneously with the low-precision backbone, which enables the integration of quantized low-precision blocks into full-precision networks throughout the training phase. Consequently, each quantized block is capable of: 1. simulating full-precision representation during forward passes; 2. obtaining gradients with improved estimation during backward passes. We demonstrate that the proposed method achieves state-of-the-art results for 4-, 3-, and 2-bit quantization on ImageNet and CIFAR-10. The proposed framework provides a compatible extension for most QAT methods and only requires a concise wrapper for existing codes.

Paper Structure

This paper contains 12 sections, 11 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: Framework Overview. (a) The fundamental implementation of Quantization-aware Training (QAT) in which weight initialization is performed using full-precision counterparts. (b) The proposed block-wise replacement framework (BWRF) generates mixed-precision models during the training phase, employing full-precision counterparts for auxiliary supervision.
  • Figure 2: Implementation of BWRF Training. Mixed-precision models are implemented implicitly through the utilization of overlapping LP forward flows. Both task targets and model predictions are regarded as loss sources for training.
  • Figure 3: Validation accuracy of mixed-precision models and low-precision backbone during training. The results are obtained by ResNet-18 on ImageNet.
  • Figure 4: Measures of feature similarity. The results of cosine distances are obtained by ResNet-18 on CIFAR-10 under 4-bit quantization.
  • Figure 5: Visualization results of class activation mapping (CAM). Two visualized targets are established on the third and fourth blocks' output layers. (a-b) The results of the full-precision model $F$ and low-precision model $Q$ trained with BWRF. (c) The results of the vanilla model with baseline implementation. (d-f) The results of mixed-precision models $M^1$, $M^2$, $M^3$, respectively.