Table of Contents
Fetching ...

Toward INT4 Fixed-Point Training via Exploring Quantization Error for Gradients

Dohyung Kim, Junghyup Lee, Jeimin Jeon, Jaehyeon Moon, Bumsub Ham

TL;DR

This work tackles the challenge of training neural networks with low-bit fixed-point gradients by analyzing how quantization errors impact learning. It shows that minimizing the error for large gradients, rather than the entire gradient distribution, improves stability and performance, and derives an upper bound (ULG) to guide an adaptive interval update for clipping. The authors implement a layer-wise uniform quantizer with c_g = γ g_max and an efficient update rule for γ that targets large-gradient fidelity, demonstrating state-of-the-art results across image classification, object detection, and super-resolution at 4/4/4 and 5/5/5-bit settings with negligible overhead. The approach is hardware-friendly and broadly applicable, reducing the gap to full-precision performance while enabling efficient training on fixed-point arithmetic.

Abstract

Network quantization generally converts full-precision weights and/or activations into low-bit fixed-point values in order to accelerate an inference process. Recent approaches to network quantization further discretize the gradients into low-bit fixed-point values, enabling an efficient training. They typically set a quantization interval using a min-max range of the gradients or adjust the interval such that the quantization error for entire gradients is minimized. In this paper, we analyze the quantization error of gradients for the low-bit fixed-point training, and show that lowering the error for large-magnitude gradients boosts the quantization performance significantly. Based on this, we derive an upper bound of quantization error for the large gradients in terms of the quantization interval, and obtain an optimal condition for the interval minimizing the quantization error for large gradients. We also introduce an interval update algorithm that adjusts the quantization interval adaptively to maintain a small quantization error for large gradients. Experimental results demonstrate the effectiveness of our quantization method for various combinations of network architectures and bit-widths on various tasks, including image classification, object detection, and super-resolution.

Toward INT4 Fixed-Point Training via Exploring Quantization Error for Gradients

TL;DR

This work tackles the challenge of training neural networks with low-bit fixed-point gradients by analyzing how quantization errors impact learning. It shows that minimizing the error for large gradients, rather than the entire gradient distribution, improves stability and performance, and derives an upper bound (ULG) to guide an adaptive interval update for clipping. The authors implement a layer-wise uniform quantizer with c_g = γ g_max and an efficient update rule for γ that targets large-gradient fidelity, demonstrating state-of-the-art results across image classification, object detection, and super-resolution at 4/4/4 and 5/5/5-bit settings with negligible overhead. The approach is hardware-friendly and broadly applicable, reducing the gap to full-precision performance while enabling efficient training on fixed-point arithmetic.

Abstract

Network quantization generally converts full-precision weights and/or activations into low-bit fixed-point values in order to accelerate an inference process. Recent approaches to network quantization further discretize the gradients into low-bit fixed-point values, enabling an efficient training. They typically set a quantization interval using a min-max range of the gradients or adjust the interval such that the quantization error for entire gradients is minimized. In this paper, we analyze the quantization error of gradients for the low-bit fixed-point training, and show that lowering the error for large-magnitude gradients boosts the quantization performance significantly. Based on this, we derive an upper bound of quantization error for the large gradients in terms of the quantization interval, and obtain an optimal condition for the interval minimizing the quantization error for large gradients. We also introduce an interval update algorithm that adjusts the quantization interval adaptively to maintain a small quantization error for large gradients. Experimental results demonstrate the effectiveness of our quantization method for various combinations of network architectures and bit-widths on various tasks, including image classification, object detection, and super-resolution.
Paper Structure (22 sections, 16 equations, 5 figures, 4 tables)

This paper contains 22 sections, 16 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Probability density function (PDF) of gradient magnitudes for a single layer. The clip-in (blue) and clip-out (red) gradients, $G_{in}$ and $G_{out}$, are subsets of large gradients $G_L$ (yellow), and $G_{in}$ and $G_{out}$ are within and beyond the clipping value $\gamma g_{max}$, respectively. See Sec. \ref{['sec:method_interval']} for more details. (Best viewed in color.)
  • Figure 2: Comparison of DSGC zhu2020towards and our baseline ($\gamma=1.0$). (a) Clipping factor of DSGC; (b) The quantization error for entire gradients $E(G)$; (c) The quantization error for large gradients $E(G_{L})$; (d) Training loss. We visualize the quantization errors for entire gradients $E(G)$ and large gradients $E(G_{L})$, while tracking the clipping factor of DSGC in the 13th layer. Top-1 accuracies of DSGC and the baseline are 24.3 and 61.1, respectively, for the test split of CIFAR-100 krizhevsky2009learning. (Best viewed in color.)
  • Figure 3: Empirical analysis on the quantization error for large gradients. (a-c) $E(G_{L})$ in 13th, 15th, 17th layers, respectively; (d) Training loss. Top-1 accuracies for the factors of 0.4, 0.6, and 0.8 are 30.3, 63.5, and 63.6, respectively, on the test split of CIFAR-100 krizhevsky2009learning. (Best viewed in color.)
  • Figure 4: Comparison of ours with the baselines in terms of quantization error for gradients. (a-c) $E(G_{L})$ in 5th, 15th, 17th layers, respectively; (d) Training loss; (e-g) $E(G)$ in 5th, 15th, 17th layers, respectively; (h) Clipping factors. (Best viewed in color.)
  • Figure 5: Comparisons of latencies for forward and backward passes using TITAN RTX on CIFAR-100 krizhevsky2009learning. We normalize the forward latency of baseline to 1.