Efficient and Effective Methods for Mixed Precision Neural Network Quantization for Faster, Energy-efficient Inference
Deepika Bablani, Jeffrey L. Mckinstry, Steven K. Esser, Rathinakumar Appuswamy, Dharmendra S. Modha
TL;DR
This work tackles the challenge of efficiently quantizing neural networks with mixed precision by introducing two information-rich metrics, ALPS and EAGL, to estimate per-layer importance for higher precision under a computation budget. By formulating the layer-precision selection as a $0$-$1$ Knapsack problem and validating additivity of layer contributions, the authors demonstrate that networks with a mix of $4$-bit and $2$-bit layers can recover full FP32 accuracy across ResNet architectures and a BERT-base model, improving the accuracy-throughput frontier. EAGL emphasizes rapid, data-agnostic layer evaluation via entropy of empirical weight distributions, while ALPS offers a straightforward, fine-tuning-based gain estimation; together they outperform state-of-the-art methods like HAWQ-v3 on vision and NLP tasks and scale to other domains. The results have practical implications for deploying energy-efficient, high-throughput inference on hardware that supports low-precision ops, enabling broader adoption of mixed-precision techniques across diverse applications.
Abstract
For efficient neural network inference, it is desirable to achieve state-of-the-art accuracy with the simplest networks requiring the least computation, memory, and power. Quantizing networks to lower precision is a powerful technique for simplifying networks. As each layer of a network may have different sensitivity to quantization, mixed precision quantization methods selectively tune the precision of individual layers to achieve a minimum drop in task performance (e.g., accuracy). To estimate the impact of layer precision choice on task performance, two methods are introduced: i) Entropy Approximation Guided Layer selection (EAGL) is fast and uses the entropy of the weight distribution, and ii) Accuracy-aware Layer Precision Selection (ALPS) is straightforward and relies on single epoch fine-tuning after layer precision reduction. Using EAGL and ALPS for layer precision selection, full-precision accuracy is recovered with a mix of 4-bit and 2-bit layers for ResNet-50, ResNet-101 and BERT-base transformer networks, demonstrating enhanced performance across the entire accuracy-throughput frontier. The techniques demonstrate better performance than existing techniques in several commensurate comparisons. Notably, this is accomplished with significantly lesser computational time required to reach a solution.
