Table of Contents
Fetching ...

Efficient and Effective Methods for Mixed Precision Neural Network Quantization for Faster, Energy-efficient Inference

Deepika Bablani, Jeffrey L. Mckinstry, Steven K. Esser, Rathinakumar Appuswamy, Dharmendra S. Modha

TL;DR

This work tackles the challenge of efficiently quantizing neural networks with mixed precision by introducing two information-rich metrics, ALPS and EAGL, to estimate per-layer importance for higher precision under a computation budget. By formulating the layer-precision selection as a $0$-$1$ Knapsack problem and validating additivity of layer contributions, the authors demonstrate that networks with a mix of $4$-bit and $2$-bit layers can recover full FP32 accuracy across ResNet architectures and a BERT-base model, improving the accuracy-throughput frontier. EAGL emphasizes rapid, data-agnostic layer evaluation via entropy of empirical weight distributions, while ALPS offers a straightforward, fine-tuning-based gain estimation; together they outperform state-of-the-art methods like HAWQ-v3 on vision and NLP tasks and scale to other domains. The results have practical implications for deploying energy-efficient, high-throughput inference on hardware that supports low-precision ops, enabling broader adoption of mixed-precision techniques across diverse applications.

Abstract

For efficient neural network inference, it is desirable to achieve state-of-the-art accuracy with the simplest networks requiring the least computation, memory, and power. Quantizing networks to lower precision is a powerful technique for simplifying networks. As each layer of a network may have different sensitivity to quantization, mixed precision quantization methods selectively tune the precision of individual layers to achieve a minimum drop in task performance (e.g., accuracy). To estimate the impact of layer precision choice on task performance, two methods are introduced: i) Entropy Approximation Guided Layer selection (EAGL) is fast and uses the entropy of the weight distribution, and ii) Accuracy-aware Layer Precision Selection (ALPS) is straightforward and relies on single epoch fine-tuning after layer precision reduction. Using EAGL and ALPS for layer precision selection, full-precision accuracy is recovered with a mix of 4-bit and 2-bit layers for ResNet-50, ResNet-101 and BERT-base transformer networks, demonstrating enhanced performance across the entire accuracy-throughput frontier. The techniques demonstrate better performance than existing techniques in several commensurate comparisons. Notably, this is accomplished with significantly lesser computational time required to reach a solution.

Efficient and Effective Methods for Mixed Precision Neural Network Quantization for Faster, Energy-efficient Inference

TL;DR

This work tackles the challenge of efficiently quantizing neural networks with mixed precision by introducing two information-rich metrics, ALPS and EAGL, to estimate per-layer importance for higher precision under a computation budget. By formulating the layer-precision selection as a - Knapsack problem and validating additivity of layer contributions, the authors demonstrate that networks with a mix of -bit and -bit layers can recover full FP32 accuracy across ResNet architectures and a BERT-base model, improving the accuracy-throughput frontier. EAGL emphasizes rapid, data-agnostic layer evaluation via entropy of empirical weight distributions, while ALPS offers a straightforward, fine-tuning-based gain estimation; together they outperform state-of-the-art methods like HAWQ-v3 on vision and NLP tasks and scale to other domains. The results have practical implications for deploying energy-efficient, high-throughput inference on hardware that supports low-precision ops, enabling broader adoption of mixed-precision techniques across diverse applications.

Abstract

For efficient neural network inference, it is desirable to achieve state-of-the-art accuracy with the simplest networks requiring the least computation, memory, and power. Quantizing networks to lower precision is a powerful technique for simplifying networks. As each layer of a network may have different sensitivity to quantization, mixed precision quantization methods selectively tune the precision of individual layers to achieve a minimum drop in task performance (e.g., accuracy). To estimate the impact of layer precision choice on task performance, two methods are introduced: i) Entropy Approximation Guided Layer selection (EAGL) is fast and uses the entropy of the weight distribution, and ii) Accuracy-aware Layer Precision Selection (ALPS) is straightforward and relies on single epoch fine-tuning after layer precision reduction. Using EAGL and ALPS for layer precision selection, full-precision accuracy is recovered with a mix of 4-bit and 2-bit layers for ResNet-50, ResNet-101 and BERT-base transformer networks, demonstrating enhanced performance across the entire accuracy-throughput frontier. The techniques demonstrate better performance than existing techniques in several commensurate comparisons. Notably, this is accomplished with significantly lesser computational time required to reach a solution.
Paper Structure (22 sections, 4 equations, 9 figures, 3 tables, 2 algorithms)

This paper contains 22 sections, 4 equations, 9 figures, 3 tables, 2 algorithms.

Figures (9)

  • Figure 1: Evaluation framework for comparing layer precision selection approaches. For a given network, computation budget, and fine-tuning procedure, identify the mixed precision method that provides the choice of precision per layer that achieves highest performance (e.g,. accuracy) on some task. A method under evaluation provides a layer-wise accuracy gain estimate, which is used along with corresponding computation costs by an optimization process to provide a precision choice per layer. The resulting network is then fine-tuned to provide performance on the task, which is used to rank the mixed precision methods considered.
  • Figure 2: Histogram of normalized counts of quantized weights in each bin for 3 layers of a trained 4-bit ResNet-101 network. EAGL predicts that layers with lower entropy are better candidates for further quantization. For example, between the three layers shown above, EAGL predicts that quantizing the first layer (entropy $= 1.3977$ bits) to 2 bits has lower impact on task accuracy than quantizing the third layer (entropy $= 3.7368$ bits).
  • Figure 3: ALPS and EAGL perform better than leading mixed precision techniques using ResNet-50 for all computational budgets and meet or exceed the accuracy of ResNet-101 for 7 out of 8 computational budgets on deng2009imagenet. Mean +/- standard deviation across 5 seeds for each technique at each budget. A network with all configurable layers at 4 bits has a computational budget of 100% and a network with all configurable layers at 2 bits has a computational budget of 50% in this plot. The first and last layer are 8-bit and intermediate layers with less than 128 input features are fixed at 4-bit.
  • Figure 4: ALPS and EAGL meet or exceed the mean IoU of leading techniques on PSPNet across computational budgets. Mean +/- standard deviation across 5 seeds at each budget.
  • Figure 5: ALPS and EAGL find more accurate mixed precision networks for SQuAD v1.1 across all computational budgets. Mean +/- standard deviation across 3 seeds at each budget.
  • ...and 4 more figures