Table of Contents
Fetching ...

GDNSQ: Gradual Differentiable Noise Scale Quantization for Low-bit Neural Networks

Sergey Salishev, Ian Akhremchik

TL;DR

This work reframes neural quantization as a chain of noisy channels and introduces GDNSQ, a gradual differentiable noise-scale quantization method that learns bit-width, noise scale, and clamps under an exterior-point penalty. It combines a differentiable STE with DoReFa-style dithering, Jeffreys-divergence distillation, and a constrained optimization objective to reach target bit-widths while preserving FP-model performance. Empirically, GDNSQ achieves competitive accuracy on CIFAR-10/100 and ImageNet across W1A1–W4A4, with ablations highlighting the importance of gradual bit-width scheduling, distillation, and LR annealing. The approach is hardware-friendly due to its reliance on uniform quantization and suggests avenues for lossless quantization relative to the FP model, as well as architecture-aware quantization strategies.

Abstract

Quantized neural networks can be viewed as a chain of noisy channels, where rounding in each layer reduces capacity as bit-width shrinks; the floating-point (FP) checkpoint sets the maximum input rate. We track capacity dynamics as the average bit-width decreases and identify resulting quantization bottlenecks by casting fine-tuning as a smooth, constrained optimization problem. Our approach employs a fully differentiable Straight-Through Estimator (STE) with learnable bit-width, noise scale and clamp bounds, and enforces a target bit-width via an exterior-point penalty; mild metric smoothing (via distillation) stabilizes training. Despite its simplicity, the method attains competitive accuracy down to the extreme W1A1 setting while retaining the efficiency of STE.

GDNSQ: Gradual Differentiable Noise Scale Quantization for Low-bit Neural Networks

TL;DR

This work reframes neural quantization as a chain of noisy channels and introduces GDNSQ, a gradual differentiable noise-scale quantization method that learns bit-width, noise scale, and clamps under an exterior-point penalty. It combines a differentiable STE with DoReFa-style dithering, Jeffreys-divergence distillation, and a constrained optimization objective to reach target bit-widths while preserving FP-model performance. Empirically, GDNSQ achieves competitive accuracy on CIFAR-10/100 and ImageNet across W1A1–W4A4, with ablations highlighting the importance of gradual bit-width scheduling, distillation, and LR annealing. The approach is hardware-friendly due to its reliance on uniform quantization and suggests avenues for lossless quantization relative to the FP model, as well as architecture-aware quantization strategies.

Abstract

Quantized neural networks can be viewed as a chain of noisy channels, where rounding in each layer reduces capacity as bit-width shrinks; the floating-point (FP) checkpoint sets the maximum input rate. We track capacity dynamics as the average bit-width decreases and identify resulting quantization bottlenecks by casting fine-tuning as a smooth, constrained optimization problem. Our approach employs a fully differentiable Straight-Through Estimator (STE) with learnable bit-width, noise scale and clamp bounds, and enforces a target bit-width via an exterior-point penalty; mild metric smoothing (via distillation) stabilizes training. Despite its simplicity, the method attains competitive accuracy down to the extreme W1A1 setting while retaining the efficiency of STE.

Paper Structure

This paper contains 37 sections, 1 theorem, 51 equations, 6 figures, 7 tables.

Key Result

Lemma A.1

Let $l,u\in\mathbb{Z}$ with $l<u$, and let $x\sim\mathrm{Unif}[l,u)$. For any $\Delta\in(0,\tfrac{1}{2})$,

Figures (6)

  • Figure 1: ResNet20 CIFAR10 W1A1 activations convergence
  • Figure 2: ResNet20 CIFAR10 W1A1 weights convergence
  • Figure 3: ResNet20 CIFAR10 training loss (eq. \ref{['eq:loss']})
  • Figure 4: ResNet20 CIFAR10 distillation loss $d$
  • Figure 5: Resnet20 CIFAR10 W1A1 bit-width evolution
  • ...and 1 more figures

Theorems & Definitions (2)

  • Lemma A.1
  • proof