Table of Contents
Fetching ...

Towards the Limit of Network Quantization

Yoojin Choi, Mostafa El-Khamy, Jungwon Lee

TL;DR

This paper addresses the challenge of compressing deep neural networks by quantizing weights under a compression constraint. It introduces Hessian-weighted distortion to prioritize parameter preservation and extends to entropy-constrained scalar quantization (ECSQ) with uniform and Lloyd-like iterative solutions, enabling effective joint all-layer quantization. The approach, validated on LeNet, ResNet-32, and AlexNet with pruning, achieves substantial compression with minimal accuracy loss, illustrating near-information-theoretic efficiency when coupled with Huffman coding. These methods offer practical, scalable pathways to deploy high-performance networks on resource-constrained hardware.

Abstract

Network quantization is one of network compression techniques to reduce the redundancy of deep neural networks. It reduces the number of distinct network parameter values by quantization in order to save the storage for them. In this paper, we design network quantization schemes that minimize the performance loss due to quantization given a compression ratio constraint. We analyze the quantitative relation of quantization errors to the neural network loss function and identify that the Hessian-weighted distortion measure is locally the right objective function for the optimization of network quantization. As a result, Hessian-weighted k-means clustering is proposed for clustering network parameters to quantize. When optimal variable-length binary codes, e.g., Huffman codes, are employed for further compression, we derive that the network quantization problem can be related to the entropy-constrained scalar quantization (ECSQ) problem in information theory and consequently propose two solutions of ECSQ for network quantization, i.e., uniform quantization and an iterative solution similar to Lloyd's algorithm. Finally, using the simple uniform quantization followed by Huffman coding, we show from our experiments that the compression ratios of 51.25, 22.17 and 40.65 are achievable for LeNet, 32-layer ResNet and AlexNet, respectively.

Towards the Limit of Network Quantization

TL;DR

This paper addresses the challenge of compressing deep neural networks by quantizing weights under a compression constraint. It introduces Hessian-weighted distortion to prioritize parameter preservation and extends to entropy-constrained scalar quantization (ECSQ) with uniform and Lloyd-like iterative solutions, enabling effective joint all-layer quantization. The approach, validated on LeNet, ResNet-32, and AlexNet with pruning, achieves substantial compression with minimal accuracy loss, illustrating near-information-theoretic efficiency when coupled with Huffman coding. These methods offer practical, scalable pathways to deploy high-performance networks on resource-constrained hardware.

Abstract

Network quantization is one of network compression techniques to reduce the redundancy of deep neural networks. It reduces the number of distinct network parameter values by quantization in order to save the storage for them. In this paper, we design network quantization schemes that minimize the performance loss due to quantization given a compression ratio constraint. We analyze the quantitative relation of quantization errors to the neural network loss function and identify that the Hessian-weighted distortion measure is locally the right objective function for the optimization of network quantization. As a result, Hessian-weighted k-means clustering is proposed for clustering network parameters to quantize. When optimal variable-length binary codes, e.g., Huffman codes, are employed for further compression, we derive that the network quantization problem can be related to the entropy-constrained scalar quantization (ECSQ) problem in information theory and consequently propose two solutions of ECSQ for network quantization, i.e., uniform quantization and an iterative solution similar to Lloyd's algorithm. Finally, using the simple uniform quantization followed by Huffman coding, we show from our experiments that the compression ratios of 51.25, 22.17 and 40.65 are achievable for LeNet, 32-layer ResNet and AlexNet, respectively.

Paper Structure

This paper contains 24 sections, 17 equations, 3 figures, 1 table, 1 algorithm.

Figures (3)

  • Figure 1: Accuracy versus average codeword length per network parameter after network quantization for 32-layer ResNet.
  • Figure 2: Accuracy versus average codeword length per network parameter after network quantization, Huffman coding and fine-tuning for LeNet and 32-layer ResNet when Hessian is computed with 50,000 or 1,000 samples and when the square roots of the second moment estimates of gradients are used instead of Hessian as an alternative.
  • Figure 3: Accuracy versus average codeword length per network parameter after network quantization, Huffman coding and fine-tuning for 32-layer ResNet when uniform quantization with non-weighted mean and uniform quantization with Hessian-weighted mean are used.