Table of Contents
Fetching ...

PikeLPN: Mitigating Overlooked Inefficiencies of Low-Precision Neural Networks

Marina Neseem, Conor McCullough, Randy Hsin, Chas Leichner, Shan Li, In Suk Chong, Andrew G. Howard, Lukasz Lew, Sherief Reda, Ville-Mikko Rautio, Daniele Moro

TL;DR

The paper identifies overlooked inefficiencies in low-precision neural networks caused by non-quantized elementwise operations, challenging the adequacy of existing metrics like ACE. It introduces ACE_v2, an extended cost metric that accounts for MACs, elementwise operations, and shifts, showing a strong correlation with hardware energy. Building on this, the authors propose PikeLPN, a family of models that quantize both MACs and elementwise operations, featuring QuantNorm for batch normalization, Double Quantization for quantization parameters, and Distribution-Heterogeneous Quantization for Separable Convolutions. Empirical results on ImageNet demonstrate up to 3× efficiency gains on the inference cost with competitive accuracy, establishing a Pareto frontier for low-precision models and highlighting the importance of accounting for elementwise costs in quantization design.

Abstract

Low-precision quantization is recognized for its efficacy in neural network optimization. Our analysis reveals that non-quantized elementwise operations which are prevalent in layers such as parameterized activation functions, batch normalization, and quantization scaling dominate the inference cost of low-precision models. These non-quantized elementwise operations are commonly overlooked in SOTA efficiency metrics such as Arithmetic Computation Effort (ACE). In this paper, we propose ACEv2 - an extended version of ACE which offers a better alignment with the inference cost of quantized models and their energy consumption on ML hardware. Moreover, we introduce PikeLPN, a model that addresses these efficiency issues by applying quantization to both elementwise operations and multiply-accumulate operations. In particular, we present a novel quantization technique for batch normalization layers named QuantNorm which allows for quantizing the batch normalization parameters without compromising the model performance. Additionally, we propose applying Double Quantization where the quantization scaling parameters are quantized. Furthermore, we recognize and resolve the issue of distribution mismatch in Separable Convolution layers by introducing Distribution-Heterogeneous Quantization which enables quantizing them to low-precision. PikeLPN achieves Pareto-optimality in efficiency-accuracy trade-off with up to 3X efficiency improvement compared to SOTA low-precision models.

PikeLPN: Mitigating Overlooked Inefficiencies of Low-Precision Neural Networks

TL;DR

The paper identifies overlooked inefficiencies in low-precision neural networks caused by non-quantized elementwise operations, challenging the adequacy of existing metrics like ACE. It introduces ACE_v2, an extended cost metric that accounts for MACs, elementwise operations, and shifts, showing a strong correlation with hardware energy. Building on this, the authors propose PikeLPN, a family of models that quantize both MACs and elementwise operations, featuring QuantNorm for batch normalization, Double Quantization for quantization parameters, and Distribution-Heterogeneous Quantization for Separable Convolutions. Empirical results on ImageNet demonstrate up to 3× efficiency gains on the inference cost with competitive accuracy, establishing a Pareto frontier for low-precision models and highlighting the importance of accounting for elementwise costs in quantization design.

Abstract

Low-precision quantization is recognized for its efficacy in neural network optimization. Our analysis reveals that non-quantized elementwise operations which are prevalent in layers such as parameterized activation functions, batch normalization, and quantization scaling dominate the inference cost of low-precision models. These non-quantized elementwise operations are commonly overlooked in SOTA efficiency metrics such as Arithmetic Computation Effort (ACE). In this paper, we propose ACEv2 - an extended version of ACE which offers a better alignment with the inference cost of quantized models and their energy consumption on ML hardware. Moreover, we introduce PikeLPN, a model that addresses these efficiency issues by applying quantization to both elementwise operations and multiply-accumulate operations. In particular, we present a novel quantization technique for batch normalization layers named QuantNorm which allows for quantizing the batch normalization parameters without compromising the model performance. Additionally, we propose applying Double Quantization where the quantization scaling parameters are quantized. Furthermore, we recognize and resolve the issue of distribution mismatch in Separable Convolution layers by introducing Distribution-Heterogeneous Quantization which enables quantizing them to low-precision. PikeLPN achieves Pareto-optimality in efficiency-accuracy trade-off with up to 3X efficiency improvement compared to SOTA low-precision models.
Paper Structure (18 sections, 9 equations, 10 figures, 9 tables)

This paper contains 18 sections, 9 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Accuracy vs $ACE_{v2}$ of PikeLPN and SOTA low-precision neural networks. $ACE_{v2}$ is an efficiency metric that estimates the cost of arithmetic operations during inference.
  • Figure 2: Contribution of multiply-accumulate (MAC) versus elementwise operations to the commonly used efficiency metric $ACE_{v2}$ for PikeLPN-1X and PokeBNN-0.5X zhang2022pokebnn. PikeLPN selectively increases the precision of MAC operations which allows for effectively quantizing elementwise operations, achieving $3\times$ more efficiency while being 2% more accurate on ImageNet.
  • Figure 3: Arithmetic Energy on 45nm CMOS technology by multiply-accumulate operations versus non-quantized elementwise operations for MobileNetV2. Energy costs are calculated using Table \ref{['tab:cost_metrics']}. The figure reveals that elementwise operations are a substantial contributor to the overall cost in low-precision models.
  • Figure 4: PikeLPN building block architecture.
  • Figure 5: Weights distribution of pre-trained PW and DW Convolution layers in PikeLPN where (a) Sample Pointwise layer weights (b) Sample Depthwise layer weights.
  • ...and 5 more figures