Table of Contents
Fetching ...

Improving Post Training Neural Quantization: Layer-wise Calibration and Integer Programming

Itay Hubara, Yury Nahshan, Yair Hanani, Ron Banner, Daniel Soudry

TL;DR

This work tackles the challenge of effective post-training quantization below 8-bit by introducing AdaQuant, a layerwise calibration-based optimization that jointly tunes weights, activations, and biases using a small calibration set. It adds an integer-programming framework to optimally allocate per-layer bit-widths under a defined degradation constraint, and BatchNorm tuning to correct distributional shifts introduced by quantization. The combination—AdaQuant, IP-based bit allocation, BN and bias tuning—yields state-of-the-art results on both vision and language models, even with minimal calibration data, and enables practical mixed-precision deployments without full re-training. The authors also provide light and advanced deployment pipelines and release code to enable widespread adoption.

Abstract

Lately, post-training quantization methods have gained considerable attention, as they are simple to use, and require only a small unlabeled calibration set. This small dataset cannot be used to fine-tune the model without significant over-fitting. Instead, these methods only use the calibration set to set the activations' dynamic ranges. However, such methods always resulted in significant accuracy degradation, when used below 8-bits (except on small datasets). Here we aim to break the 8-bit barrier. To this end, we minimize the quantization errors of each layer separately by optimizing its parameters over the calibration set. We empirically demonstrate that this approach is: (1) much less susceptible to over-fitting than the standard fine-tuning approaches, and can be used even on a very small calibration set; and (2) more powerful than previous methods, which only set the activations' dynamic ranges. Furthermore, we demonstrate how to optimally allocate the bit-widths for each layer, while constraining accuracy degradation or model compression by proposing a novel integer programming formulation. Finally, we suggest model global statistics tuning, to correct biases introduced during quantization. Together, these methods yield state-of-the-art results for both vision and text models. For instance, on ResNet50, we obtain less than 1\% accuracy degradation --- with 4-bit weights and activations in all layers, but the smallest two. We open-sourced our code.

Improving Post Training Neural Quantization: Layer-wise Calibration and Integer Programming

TL;DR

This work tackles the challenge of effective post-training quantization below 8-bit by introducing AdaQuant, a layerwise calibration-based optimization that jointly tunes weights, activations, and biases using a small calibration set. It adds an integer-programming framework to optimally allocate per-layer bit-widths under a defined degradation constraint, and BatchNorm tuning to correct distributional shifts introduced by quantization. The combination—AdaQuant, IP-based bit allocation, BN and bias tuning—yields state-of-the-art results on both vision and language models, even with minimal calibration data, and enables practical mixed-precision deployments without full re-training. The authors also provide light and advanced deployment pipelines and release code to enable widespread adoption.

Abstract

Lately, post-training quantization methods have gained considerable attention, as they are simple to use, and require only a small unlabeled calibration set. This small dataset cannot be used to fine-tune the model without significant over-fitting. Instead, these methods only use the calibration set to set the activations' dynamic ranges. However, such methods always resulted in significant accuracy degradation, when used below 8-bits (except on small datasets). Here we aim to break the 8-bit barrier. To this end, we minimize the quantization errors of each layer separately by optimizing its parameters over the calibration set. We empirically demonstrate that this approach is: (1) much less susceptible to over-fitting than the standard fine-tuning approaches, and can be used even on a very small calibration set; and (2) more powerful than previous methods, which only set the activations' dynamic ranges. Furthermore, we demonstrate how to optimally allocate the bit-widths for each layer, while constraining accuracy degradation or model compression by proposing a novel integer programming formulation. Finally, we suggest model global statistics tuning, to correct biases introduced during quantization. Together, these methods yield state-of-the-art results for both vision and text models. For instance, on ResNet50, we obtain less than 1\% accuracy degradation --- with 4-bit weights and activations in all layers, but the smallest two. We open-sourced our code.

Paper Structure

This paper contains 29 sections, 19 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Comparison of different optimization methods over ResNet-50 quantized to 4 bit except the first and the last layers which were kept in 8bit. Even optimizing on a single image drastically improves the results but as expected have a high variance (red bar). The variance decreases rapidly as calibration set size increases.
  • Figure 2: AdaQuant vs. AdaRound. (a) A histogram of $\Delta W$ distribution. AdaRound restricts this additive term to be $\Delta W=\pm1$. Relaxing this constraint provides a more powerful optimization. (b) Ablation study on parameters optimization for ResNet50 over ImageNet. AdaRound is based exclusively on weight optimization, while AdaQuant optimizes the weights, biases, and other quantization parameters jointly.
  • Figure 3: Ablation study over ResNet-50/18 and MobileNet-V2 - compression-accuracy curves. Our advanced pipeline is consist of AdaQuant, IP-mixed-precision, BN-tuning and bias-tuning. Our light pipeline consists of only IP-mixed-precision, BN-tuning. The relaxed advanced pipeline appears in \ref{['fig:resnet_abalation-study']} is similar to the advance pipeline but allows the integer-programming to choose any bit-width between 2-8 and not just 4-bit or 8-bit. The compression ratio is measured as the ratio between the compressed model and the full-precision (32-bit) mode thus 0.25 compression rate indicate that the entire model uses 8-bit precision and respectively for 4-bit the compression rate is 0.125
  • Figure B.1:
  • Figure D.2: Calibration size ablation study with additional early-stop plot.