Scheduling Weight Transitions for Quantization-Aware Training

Junghyup Lee; Jeimin Jeon; Dohyung Kim; Bumsub Ham

Scheduling Weight Transitions for Quantization-Aware Training

Junghyup Lee, Jeimin Jeon, Dohyung Kim, Bumsub Ham

TL;DR

This work tackles the challenge that traditional LR-based optimization in quantization-aware training (QAT) poorly controls the extent of changes in quantized weights, which only flip between discrete levels when latent weights cross quantizer transition points. It introduces a transition-rate (TR) scheduling framework that directly targets the fraction of weights changing quantization levels, paired with a transition-adaptive learning rate (TALR) to steer latent updates toward a desired TR, thereby enabling coarse-to-fine, stable QAT optimization. The TR scheduler is applicable to both binary and multi-bit quantization and is compatible with multiple optimizers (SGD, Adam, AdamW, etc.), achieving consistent accuracy gains across image classification and object detection benchmarks (e.g., ImageNet, COCO) and reducing oscillations observed in QAT. The approach yields state-of-the-art or competitive results across architectures (e.g., ResNet, MobileNetV2, ReActNet, DeiT) and bit-widths, with modest computational overhead, demonstrating practical impact for robust, hardware-friendly quantization.

Abstract

Quantization-aware training (QAT) simulates a quantization process during training to lower bit-precision of weights/activations. It learns quantized weights indirectly by updating latent weights,i.e., full-precision inputs to a quantizer, using gradient-based optimizers. We claim that coupling a user-defined learning rate (LR) with these optimizers is sub-optimal for QAT. Quantized weights transit discrete levels of a quantizer, only if corresponding latent weights pass transition points, where the quantizer changes discrete states. This suggests that the changes of quantized weights are affected by both the LR for latent weights and their distributions. It is thus difficult to control the degree of changes for quantized weights by scheduling the LR manually. We conjecture that the degree of parameter changes in QAT is related to the number of quantized weights transiting discrete levels. Based on this, we introduce a transition rate (TR) scheduling technique that controls the number of transitions of quantized weights explicitly. Instead of scheduling a LR for latent weights, we schedule a target TR of quantized weights, and update the latent weights with a novel transition-adaptive LR (TALR), enabling considering the degree of changes for the quantized weights during QAT. Experimental results demonstrate the effectiveness of our approach on standard benchmarks.

Scheduling Weight Transitions for Quantization-Aware Training

TL;DR

Abstract

Paper Structure (45 sections, 19 equations, 10 figures, 12 tables, 1 algorithm)

This paper contains 45 sections, 19 equations, 10 figures, 12 tables, 1 algorithm.

Introduction
Related Work
QAT.
Optimization Methods.
Preliminary
Quantizer.
Optimizer.
Method
Empirical analysis
TR scheduler
TR of quantized weights.
Relation between an effective step size and a transition.
TR scheduler.
Quantization scheme
Multi-bit quantization.
...and 30 more sections

Figures (10)

Figure 1: Training curves of full-precision (FP) and quantized models for ResNet-20 he2016deep on CIFAR-100 krizhevsky2009learning. Both weights (W) and activations (A) are quantized to a 2-bit precision (W2A2). With a gradient-based optimizer (SGD), we can control the average effective step size of FP weights roughly by scheduling a LR (\ref{['fig:teaser_lr']} vs. \ref{['fig:teaser_fp']}), while we could not for quantized weights (the blue curve in \ref{['fig:teaser_quant']}). The curve for quantized weights is noisy, and decreases rapidly at the end of training, suggesting that 1) the quantized weights can alter significantly with a small LR and/or a small change of a LR, disturbing a coarse-to-fine parameter update and causing an unstable training, and 2) adopting a manually scheduled LR for QAT is sub-optimal. The optimizer coupled with our scheduling technique (SGDT) can control the average effective step size of quantized weights by adjusting the number of transitions explicitly (the red curve in \ref{['fig:teaser_quant']}), showing better results in terms of accuracy and convergence (the red curve in \ref{['fig:teaser_acc']}).
Figure 2: Empirical analysis on QAT using SGD with a step LR decay. We binarize both weights and activations of ResNet-20 he2016deep and train the model on CIFAR-100 krizhevsky2009learning. For the visualizations in \ref{['fig:empirical_update']} and \ref{['fig:empirical_distribution']}, we track the latent and quantized weights in the 16$^\text{th}$ layer. We can see that the average effective step size of latent weights (the blue curve in \ref{['fig:empirical_update']}) is controlled by the LR in \ref{['fig:empirical_lr']}, while that for the quantized weights changes significantly even with a small LR (the red curve in \ref{['fig:empirical_update']}). This is because the change of quantized weights is also affected by the distribution of latent weights approaching the transition point (i.e., zero in \ref{['fig:empirical_distribution']}). The large changes in the quantized weights at the end of training (the red curve in \ref{['fig:empirical_update']}) degrade the performance in \ref{['fig:empirical_acc']}. (Best viewed in color.)
Figure 3: Analysis on TR scheduling. We train ResNet-20 he2016deep on CIFAR-100 krizhevsky2009learning using SGDT, where we quantize both weights and activations with 2-bit representations. We visualize distributions of normalized latent weights in the 16$^\text{th}$ layer in \ref{['fig:discussion_distribution']}, and average distances between normalized latent weights and the nearest transition points in \ref{['fig:discussion_MD2TP']}. The transition points in \ref{['fig:discussion_distribution']} are denoted by TPs in the x-axis. The top-1 test accuracy and average effective step sizes of quantized weights are shown by the red curves in Figs. \ref{['fig:teaser_acc']} and \ref{['fig:teaser_quant']}, respectively.
Figure S1: Training curves for quantized models using Adam kingma2014adam and AdamT on ImageNet deng2009imagenet. The results in the first and second rows are obtained with MobileNetV2 sandler2018mobilenetv2 and ReActNet-18 liu2020reactnet using 4-bit and binary weights/activations, respectively. For the visualizations in \ref{['fig:appendix_training_curves_stepsize']}, we monitor the average effective step sizes of quantized weights in the 19$^\text{th}$ and 17$^\text{th}$ layers of MobileNetV2 and ReActNet-18, respectively. (Best viewed in color.)
Figure S2: Qualitative comparison for object detection on MS COCO lin2014microsoft using RetinaNet lin2017focal with the ResNet-50 he2016deep backbone, where both weights and activations are quantized into 4-bit. The results of baseline and ours are obtained from the models trained with SGD and SGDT, respectively. (Best viewed in color.)
...and 5 more figures

Scheduling Weight Transitions for Quantization-Aware Training

TL;DR

Abstract

Scheduling Weight Transitions for Quantization-Aware Training

Authors

TL;DR

Abstract

Table of Contents

Figures (10)