Table of Contents
Fetching ...

How to Parameterize Asymmetric Quantization Ranges for Quantization-Aware Training

Jaeseong You, Minseop Park, Kyunggeun Lee, Seokjun An, Chirag Patel, Markus Nage

TL;DR

The paper analyzes three asymmetric QAT parameterizations—scale/offset, min/max, and beta/gamma—through controlled toy experiments and real-world large language model quantization to understand how learnable ranges respond to bit width and learning rate. It finds that scale/offset is prone to instability and poor convergence, especially under challenging bit widths, whereas min/max demonstrates greater robustness but slower convergence; beta/gamma enables fast, per-channel range learning and, when used without a sigmoid, often yields faster training and lower loss. The authors conclude with practical guidance: use min/max with appropriately scaled learning rates, and employ sigmoid-free beta/gamma for robust, rapid QAT, while noting beta/gamma can dynamically adapt to true min/max values and offer per-channel flexibility. They also discuss potential future directions, including distribution-aware and mixed-parameterization QAT to further optimize quantization for diverse weights and activations in large models.

Abstract

This paper investigates three different parameterizations of asymmetric uniform quantization for quantization-aware training: (1) scale and offset, (2) minimum and maximum, and (3) beta and gamma. We perform a comprehensive comparative analysis of these parameterizations' influence on quantization-aware training, using both controlled experiments and real-world large language models. Our particular focus is on their changing behavior in response to critical training hyperparameters, bit width and learning rate. Based on our investigation, we propose best practices to stabilize and accelerate quantization-aware training with learnable asymmetric quantization ranges.

How to Parameterize Asymmetric Quantization Ranges for Quantization-Aware Training

TL;DR

The paper analyzes three asymmetric QAT parameterizations—scale/offset, min/max, and beta/gamma—through controlled toy experiments and real-world large language model quantization to understand how learnable ranges respond to bit width and learning rate. It finds that scale/offset is prone to instability and poor convergence, especially under challenging bit widths, whereas min/max demonstrates greater robustness but slower convergence; beta/gamma enables fast, per-channel range learning and, when used without a sigmoid, often yields faster training and lower loss. The authors conclude with practical guidance: use min/max with appropriately scaled learning rates, and employ sigmoid-free beta/gamma for robust, rapid QAT, while noting beta/gamma can dynamically adapt to true min/max values and offer per-channel flexibility. They also discuss potential future directions, including distribution-aware and mixed-parameterization QAT to further optimize quantization for diverse weights and activations in large models.

Abstract

This paper investigates three different parameterizations of asymmetric uniform quantization for quantization-aware training: (1) scale and offset, (2) minimum and maximum, and (3) beta and gamma. We perform a comprehensive comparative analysis of these parameterizations' influence on quantization-aware training, using both controlled experiments and real-world large language models. Our particular focus is on their changing behavior in response to critical training hyperparameters, bit width and learning rate. Based on our investigation, we propose best practices to stabilize and accelerate quantization-aware training with learnable asymmetric quantization ranges.
Paper Structure (10 sections, 10 equations, 9 figures, 3 tables)

This paper contains 10 sections, 10 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Computational graph of asymmetric quantization.
  • Figure 2: Computational graph of symmetric quantization.
  • Figure 3: Learnable ranges of scale/offset and min/max (x-axis) changing over 5k steps of QAT (y-axis). scale/offset and min/max are respectively color-coded as red and blue, and lighter shades correspond to a learning rate of 1e-2 (darker shades to that of 1e-3). The left subfigure represents 3-bit quantization (10-bit on the right). Although we experimented with 16 bit as well, scale/offset resulted in excessively large values that could not be effectively visualized.
  • Figure 4: Cross-entropy loss of GPT2-small QAT (y-axis) over 2k training steps (x-axis). Left depicts QAT based on min/max and scale/offset. Right depicts QAT based on min/max and beta/gamma (with and without sigmoid).
  • Figure 5: Learnable ranges of min/max and beta/gamma changing over the course of QAT. beta/gamma is color-coded in green (min/max in blue). min/max+ and sigmoid-applied beta/gamma are depicted with dashed lines. The other details of the experiment are identical to those in Figure \ref{['fig:sz_vs_mm_toy']} except that we have omitted the case of $lr=1e-2$ for visual clarity.
  • ...and 4 more figures