Table of Contents
Fetching ...

A2Q+: Improving Accumulator-Aware Weight Quantization

Ian Colbert, Alessandro Pappalardo, Jakoba Petri-Koenig, Yaman Umuroglu

TL;DR

This work addresses hardware-efficient neural inference under very low-precision accumulation by extending accumulator-aware quantization (A2Q) with A2Q+. The authors identify that A2Q's $\lower$-norm bound and initialization are overly restrictive, introducing quantization error that worsens as the accumulator width decreases. They propose A2Q+ with a zero-centered bound ${}_{\space}\lVert\bm{q}\rVert_{1} \le \frac{2^P - 2}{2^N - 1}$ and Euclidean-projection initialization to minimize initialization error, both compatible with weight normalization. Across CIFAR-10, BSD300, and ImageNet, A2Q+ yields Pareto-dominant trade-offs, enabling ResNet50 to reach roughly 95% of 32-bit accuracy at 12-bit accumulators and outperforming A2Q by about 17 percentage points in some settings, while preserving robustness against overflow. The work highlights practical implications for hardware-aware quantization, including depthwise-convolution considerations and sparsity opportunities for accelerators, pointing to future refinements in deployment-aware optimization and structured sparsity.

Abstract

Quantization techniques commonly reduce the inference costs of neural networks by restricting the precision of weights and activations. Recent studies show that also reducing the precision of the accumulator can further improve hardware efficiency at the risk of numerical overflow, which introduces arithmetic errors that can degrade model accuracy. To avoid numerical overflow while maintaining accuracy, recent work proposed accumulator-aware quantization (A2Q), a quantization-aware training method that constrains model weights during training to safely use a target accumulator bit width during inference. Although this shows promise, we demonstrate that A2Q relies on an overly restrictive constraint and a sub-optimal weight initialization strategy that each introduce superfluous quantization error. To address these shortcomings, we introduce: (1) an improved bound that alleviates accumulator constraints without compromising overflow avoidance; and (2) a new strategy for initializing quantized weights from pre-trained floating-point checkpoints. We combine these contributions with weight normalization to introduce A2Q+. We support our analysis with experiments that show A2Q+ significantly improves the trade-off between accumulator bit width and model accuracy and characterize new trade-offs that arise as a consequence of accumulator constraints.

A2Q+: Improving Accumulator-Aware Weight Quantization

TL;DR

This work addresses hardware-efficient neural inference under very low-precision accumulation by extending accumulator-aware quantization (A2Q) with A2Q+. The authors identify that A2Q's -norm bound and initialization are overly restrictive, introducing quantization error that worsens as the accumulator width decreases. They propose A2Q+ with a zero-centered bound and Euclidean-projection initialization to minimize initialization error, both compatible with weight normalization. Across CIFAR-10, BSD300, and ImageNet, A2Q+ yields Pareto-dominant trade-offs, enabling ResNet50 to reach roughly 95% of 32-bit accuracy at 12-bit accumulators and outperforming A2Q by about 17 percentage points in some settings, while preserving robustness against overflow. The work highlights practical implications for hardware-aware quantization, including depthwise-convolution considerations and sparsity opportunities for accelerators, pointing to future refinements in deployment-aware optimization and structured sparsity.

Abstract

Quantization techniques commonly reduce the inference costs of neural networks by restricting the precision of weights and activations. Recent studies show that also reducing the precision of the accumulator can further improve hardware efficiency at the risk of numerical overflow, which introduces arithmetic errors that can degrade model accuracy. To avoid numerical overflow while maintaining accuracy, recent work proposed accumulator-aware quantization (A2Q), a quantization-aware training method that constrains model weights during training to safely use a target accumulator bit width during inference. Although this shows promise, we demonstrate that A2Q relies on an overly restrictive constraint and a sub-optimal weight initialization strategy that each introduce superfluous quantization error. To address these shortcomings, we introduce: (1) an improved bound that alleviates accumulator constraints without compromising overflow avoidance; and (2) a new strategy for initializing quantized weights from pre-trained floating-point checkpoints. We combine these contributions with weight normalization to introduce A2Q+. We support our analysis with experiments that show A2Q+ significantly improves the trade-off between accumulator bit width and model accuracy and characterize new trade-offs that arise as a consequence of accumulator constraints.
Paper Structure (21 sections, 5 theorems, 23 equations, 8 figures, 3 tables)

This paper contains 21 sections, 5 theorems, 23 equations, 8 figures, 3 tables.

Key Result

Proposition 3.0

Let $\bm{x}$ be a $K$-dimensional vector of $N$-bit integers such that the value of the $i$-th element $x_i$ lies within the closed interval $[c,d]$ and $d - c = 2^N - 1$. Let $\bm{q}$ be a $K$-dimensional vector of signed integers centered at zero such that $\sum_i q_i = 0$. To guarantee overflow a

Figures (8)

  • Figure 1: We visualize Eq. \ref{['eq:ratio_new_to_old']} for both signed (blue crosses) and unsigned (green circles) integers to show the relative increase in $\ell_1$-norm budget that our new bound (Eq. \ref{['eq:new_bound_prop_1']}) gives to $\bm{q}$ when compared to the standard A2Q bound (Eq. \ref{['eq:a2q_v1']}).
  • Figure 2: We visualize the trade-off between accumulator bit width and model accuracy using Pareto frontier. We observe that A2Q+ (green triangles) dominates both A2Q (blue circles) and the baseline QAT (red stars) in all benchmarks.
  • Figure 3: We evaluate the trade-off between activation bit width $N$ and model accuracy under fixed accumulator constraints. We visualize the average and standard deviation in model accuracy measured over 3 experiments as $N$ is increased from $3$ to $8$ bits when targeting accumulator widths that range from $14$ to $20$ bits. The weights of all hidden layers are fixed to $4$-bits.
  • Figure 4: We evaluate the impact of zero-centering on depthwise convolutions as we reduce the target accumulator bit width. We visualize the maximum observed test top-1 accuracy when training a W4A4 MobileNetV1 model on CIFAR10. We show that using A2Q for all depthwise convolutions and A2Q+ for all other hidden layers (green triangles) outperforms uniformly applying A2Q (blue circles) or A2Q+ (red crosses) to all hidden layers.
  • Figure 5: We visualize the test cross entropy loss when training ResNet18, ResNet34, and ResNet50 to classify ImageNet images using 4-bit weights and activations (W4A4) and targeting 14-bit accumulation using A2Q. We observe that our Euclidean projection initialization (EP-init) helps improve convergence. Note that respective test top-1 accuracies are detailed in Table \ref{['tbl:imagenet']}.
  • ...and 3 more figures

Theorems & Definitions (8)

  • Proposition 3.0
  • Proposition 3.0
  • Proposition 1.0
  • proof
  • Lemma 1.1
  • proof
  • Proposition 1.1
  • proof