Table of Contents
Fetching ...

Quantization Range Estimation for Convolutional Neural Networks

Bingtao Yang, Yujia Wang, Mengzhi Jiao, Hongwei Huo

TL;DR

This work tackles the challenge of maintaining high accuracy in post-training quantization (PTQ) at low bit-width by introducing REQuant, a per-layer range-estimation framework that optimizes a scale-controlled parameter $\alpha \in (0,1]$ to minimize quantization error. The authors prove local convexity of the objective and solve it efficiently with a golden-section search, and further boost performance by a weight reshaping transform that allocates quantization intervals nonuniformly across the weight distribution. Experiments on CIFAR-10/100 with ResNet families and Inception-v3 show state-of-the-art top-1 accuracy at 8-bit and 6-bit, with meaningful gains at 4-bit, outperforming several PTQ baselines. The approach is practical, requires no retraining, and is accompanied by a public code release, enabling broader adoption in resource-constrained deployments.

Abstract

Post-training quantization for reducing the storage of deep neural network models has been demonstrated to be an effective way in various tasks. However, low-bit quantization while maintaining model accuracy is a challenging problem. In this paper, we present a range estimation method to improve the quantization performance for post-training quantization. We model the range estimation into an optimization problem of minimizing quantization errors by layer-wise local minima. We prove this problem is locally convex and present an efficient search algorithm to find the optimal solution. We propose the application of the above search algorithm to the transformed weights space to do further improvement in practice. Our experiments demonstrate that our method outperforms state-of-the-art performance generally on top-1 accuracy for image classification tasks on the ResNet series models and Inception-v3 model. The experimental results show that the proposed method has almost no loss of top-1 accuracy in 8-bit and 6-bit settings for image classifications, and the accuracy of 4-bit quantization is also significantly improved. The code is available at https://github.com/codeiscommitting/REQuant.

Quantization Range Estimation for Convolutional Neural Networks

TL;DR

This work tackles the challenge of maintaining high accuracy in post-training quantization (PTQ) at low bit-width by introducing REQuant, a per-layer range-estimation framework that optimizes a scale-controlled parameter to minimize quantization error. The authors prove local convexity of the objective and solve it efficiently with a golden-section search, and further boost performance by a weight reshaping transform that allocates quantization intervals nonuniformly across the weight distribution. Experiments on CIFAR-10/100 with ResNet families and Inception-v3 show state-of-the-art top-1 accuracy at 8-bit and 6-bit, with meaningful gains at 4-bit, outperforming several PTQ baselines. The approach is practical, requires no retraining, and is accompanied by a public code release, enabling broader adoption in resource-constrained deployments.

Abstract

Post-training quantization for reducing the storage of deep neural network models has been demonstrated to be an effective way in various tasks. However, low-bit quantization while maintaining model accuracy is a challenging problem. In this paper, we present a range estimation method to improve the quantization performance for post-training quantization. We model the range estimation into an optimization problem of minimizing quantization errors by layer-wise local minima. We prove this problem is locally convex and present an efficient search algorithm to find the optimal solution. We propose the application of the above search algorithm to the transformed weights space to do further improvement in practice. Our experiments demonstrate that our method outperforms state-of-the-art performance generally on top-1 accuracy for image classification tasks on the ResNet series models and Inception-v3 model. The experimental results show that the proposed method has almost no loss of top-1 accuracy in 8-bit and 6-bit settings for image classifications, and the accuracy of 4-bit quantization is also significantly improved. The code is available at https://github.com/codeiscommitting/REQuant.

Paper Structure

This paper contains 10 sections, 2 theorems, 18 equations, 5 tables, 1 algorithm.

Key Result

Lemma 1

The minimization problem defined in eq:fmse is locally convex around any solution $\alpha^{*}$.

Theorems & Definitions (4)

  • Lemma 1
  • proof
  • Lemma 2
  • proof