Table of Contents
Fetching ...

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, Dmitry Kalenichenko

TL;DR

This work tackles the challenge of deploying CNNs on mobile devices by enabling integer-arithmetic-only inference through 8-bit quantization of weights/activations and a co-designed training procedure. It introduces an affine quantization scheme, efficient zero-point handling, and fused layer implementations, paired with simulated quantization during training to preserve accuracy. The approach is validated on MobileNets, ResNets, and Inception-v3, showing competitive accuracy with substantial latency reductions on ARM CPUs for ImageNet and COCO tasks. The results demonstrate the potential for real-time, on-device vision applications on widely available hardware, bridging the gap between accuracy and efficiency.

Abstract

The rising popularity of intelligent mobile devices and the daunting computational cost of deep learning-based models call for efficient and accurate on-device inference schemes. We propose a quantization scheme that allows inference to be carried out using integer-only arithmetic, which can be implemented more efficiently than floating point inference on commonly available integer-only hardware. We also co-design a training procedure to preserve end-to-end model accuracy post quantization. As a result, the proposed quantization scheme improves the tradeoff between accuracy and on-device latency. The improvements are significant even on MobileNets, a model family known for run-time efficiency, and are demonstrated in ImageNet classification and COCO detection on popular CPUs.

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

TL;DR

This work tackles the challenge of deploying CNNs on mobile devices by enabling integer-arithmetic-only inference through 8-bit quantization of weights/activations and a co-designed training procedure. It introduces an affine quantization scheme, efficient zero-point handling, and fused layer implementations, paired with simulated quantization during training to preserve accuracy. The approach is validated on MobileNets, ResNets, and Inception-v3, showing competitive accuracy with substantial latency reductions on ARM CPUs for ImageNet and COCO tasks. The results demonstrate the potential for real-time, on-device vision applications on widely available hardware, bridging the gap between accuracy and efficiency.

Abstract

The rising popularity of intelligent mobile devices and the daunting computational cost of deep learning-based models call for efficient and accurate on-device inference schemes. We propose a quantization scheme that allows inference to be carried out using integer-only arithmetic, which can be implemented more efficiently than floating point inference on commonly available integer-only hardware. We also co-design a training procedure to preserve end-to-end model accuracy post quantization. As a result, the proposed quantization scheme improves the tradeoff between accuracy and on-device latency. The improvements are significant even on MobileNets, a model family known for run-time efficiency, and are demonstrated in ImageNet classification and COCO detection on popular CPUs.

Paper Structure

This paper contains 30 sections, 15 equations, 12 figures, 8 tables, 1 algorithm.

Figures (12)

  • Figure 1.1: Integer-arithmetic-only quantization.a) Integer-arithmetic-only inference of a convolution layer. The input and output are represented as $8$-bit integers according to equation \ref{['eq:quant-scheme']}. The convolution involves 8-bit integer operands and a 32-bit integer accumulator. The bias addition involves only 32-bit integers (section \ref{['sec:fused-layer']}). The ReLU6 nonlinearity only involves $8$-bit integer arithmetic. b) Training with simulated quantization of the convolution layer. All variables and computations are carried out using 32-bit floating-point arithmetic. Weight quantization ("wt quant") and activation quantization ("act quant") nodes are injected into the computation graph to simulate the effects of quantization of the variables (section \ref{['sec:training']}). The resultant graph approximates the integer-arithmetic-only computation graph in panel a), while being trainable using conventional optimization algorithms for floating point models. c) Our quantization scheme benefits from the fast integer-arithmetic circuits in common CPUs to deliver an improved latency-vs-accuracy tradeoff (section \ref{['sec:experiments']}). The figure compares integer quantized MobileNets MobilenetV1 against floating point baselines on ImageNet deng2009imagenet using Qualcomm Snapdragon 835 LITTLE cores.
  • Figure 4.1: ImageNet classifier on Qualcomm Snapdragon 835 big cores: Latency-vs-accuracy tradeoff of floating-point and integer-only MobileNets.
  • Figure 4.2: ImageNet classifier on Qualcomm Snapdragon 821: Latency-vs-accuracy tradeoff of floating-point and integer-only MobileNets.
  • Figure 4.3: Face attribute classifier on Qualcomm Snapdragon 821: Latency-vs-accuracy tradeoff of floating-point and integer-only MobileNets.
  • Figure C.1: Simple graph: original
  • ...and 7 more figures