Table of Contents
Fetching ...

HAWQ-V2: Hessian Aware trace-Weighted Quantization of Neural Networks

Zhen Dong, Zhewei Yao, Yaohui Cai, Daiyaan Arfeen, Amir Gholami, Michael W. Mahoney, Kurt Keutzer

TL;DR

HAWQ-V2 introduces a Hessian-trace-based framework to automate mixed-precision quantization, replacing the prior top-eigenvalue sensitivity with the average Hessian trace and using Hutchinson trace estimation for scalability. A Pareto-frontier approach then selects per-layer bit-precisions under a target model size, and the method is extended to mixed-precision activation quantization, with per-input activation trace estimation that exploits a block-diagonal Hessian structure. The approach delivers state-of-the-art compression-accuracy tradeoffs on ImageNet models (e.g., Inception-V3, ResNet-50, SqueezeNext) and improves object-detection performance on COCO RetinaNet, including benefits from activation-quantization strategies. These results demonstrate practical viability of leveraging second-order information to guide quantization in both classification and detection tasks, with potential for further gains by training to flatten loss landscapes and handling data-constrained scenarios.

Abstract

Quantization is an effective method for reducing memory footprint and inference time of Neural Networks, e.g., for efficient inference in the cloud, especially at the edge. However, ultra low precision quantization could lead to significant degradation in model generalization. A promising method to address this is to perform mixed-precision quantization, where more sensitive layers are kept at higher precision. However, the search space for a mixed-precision quantization is exponential in the number of layers. Recent work has proposed HAWQ, a novel Hessian based framework, with the aim of reducing this exponential search space by using second-order information. While promising, this prior work has three major limitations: (i) HAWQV1 only uses the top Hessian eigenvalue as a measure of sensitivity and do not consider the rest of the Hessian spectrum; (ii) HAWQV1 approach only provides relative sensitivity of different layers and therefore requires a manual selection of the mixed-precision setting; and (iii) HAWQV1 does not consider mixed-precision activation quantization. Here, we present HAWQV2 which addresses these shortcomings. For (i), we perform a theoretical analysis showing that a better sensitivity metric is to compute the average of all of the Hessian eigenvalues. For (ii), we develop a Pareto frontier based method for selecting the exact bit precision of different layers without any manual selection. For (iii), we extend the Hessian analysis to mixed-precision activation quantization. We have found this to be very beneficial for object detection. We show that HAWQV2 achieves new state-of-the-art results for a wide range of tasks.

HAWQ-V2: Hessian Aware trace-Weighted Quantization of Neural Networks

TL;DR

HAWQ-V2 introduces a Hessian-trace-based framework to automate mixed-precision quantization, replacing the prior top-eigenvalue sensitivity with the average Hessian trace and using Hutchinson trace estimation for scalability. A Pareto-frontier approach then selects per-layer bit-precisions under a target model size, and the method is extended to mixed-precision activation quantization, with per-input activation trace estimation that exploits a block-diagonal Hessian structure. The approach delivers state-of-the-art compression-accuracy tradeoffs on ImageNet models (e.g., Inception-V3, ResNet-50, SqueezeNext) and improves object-detection performance on COCO RetinaNet, including benefits from activation-quantization strategies. These results demonstrate practical viability of leveraging second-order information to guide quantization in both classification and detection tasks, with potential for further gains by training to flatten loss landscapes and handling data-constrained scenarios.

Abstract

Quantization is an effective method for reducing memory footprint and inference time of Neural Networks, e.g., for efficient inference in the cloud, especially at the edge. However, ultra low precision quantization could lead to significant degradation in model generalization. A promising method to address this is to perform mixed-precision quantization, where more sensitive layers are kept at higher precision. However, the search space for a mixed-precision quantization is exponential in the number of layers. Recent work has proposed HAWQ, a novel Hessian based framework, with the aim of reducing this exponential search space by using second-order information. While promising, this prior work has three major limitations: (i) HAWQV1 only uses the top Hessian eigenvalue as a measure of sensitivity and do not consider the rest of the Hessian spectrum; (ii) HAWQV1 approach only provides relative sensitivity of different layers and therefore requires a manual selection of the mixed-precision setting; and (iii) HAWQV1 does not consider mixed-precision activation quantization. Here, we present HAWQV2 which addresses these shortcomings. For (i), we perform a theoretical analysis showing that a better sensitivity metric is to compute the average of all of the Hessian eigenvalues. For (ii), we develop a Pareto frontier based method for selecting the exact bit precision of different layers without any manual selection. For (iii), we extend the Hessian analysis to mixed-precision activation quantization. We have found this to be very beneficial for object detection. We show that HAWQV2 achieves new state-of-the-art results for a wide range of tasks.

Paper Structure

This paper contains 14 sections, 1 theorem, 16 equations, 7 figures, 4 tables.

Key Result

Lemma 1

\newlabellemma_1 Suppose we quantize two layers (denoted by $B_1$ and $B_2$) with same amount of perturbation, namely $\left\lVert\Delta W_1^*\right\rVert_2^2$ = $\left\lVert\Delta W_2^*\right\rVert_2^2$. Then, under Assumption assumption:1, we will have: if

Figures (7)

  • Figure 1.1: Mixed Precision Illustration of ResNet20. Here we show the network architecture and list four possible bit precision setting for each layer. Since the number of possible bit settings is an exponential function of the number of blocks in a given network, we propose HAWQ-V2 to generate precision settings automatically based on Hessian information instead of using simple search methods wu2018mixedwang2018haq.
  • Figure 2.1: Average Hessian trace of different blocks in Inception-V3 and ResNet50 on ImageNet, along with the loss landscape of the block 4 and 16 in Inception-V3 (block 1 and 52 in ResNet50). As one can see, the average Hessian trace is significantly different for different blocks. We use this information to determine the quantization precision setting, i.e., we assign higher bits for blocks with larger average Hessian trace, and fewer bits for blocks with smaller average Hessian trace.
  • Figure 2.2: Illustration of the structure of Hessian w.r.t to activations ($H_{a_j}$). It is evident that different sized inputs $x_i$ will produce different sized blocks $H_{a_j(x_i)}$ which appear on the diagonal of $H_{a_j}$.
  • Figure 2.3: Pareto Frontier: The trade-off between model size and the sum of $\Omega$ metric (of Eqn. (\ref{['eqn:define_O']})) in Inception-V3. Here, $L$ is the number of blocks in the model, and each point in the figure stands for a specific bit precision setting. We show the bit precision setting used in Direct quantization as well as HAWQ. To achieve fair comparison, we set constraint on HAWQ-V2 to have the same model size as HAWQ.
  • Figure 3.1: Relationship between the convergence of Hutchinson and the number of data points (Left) as well as the number of steps (Right) used for trace estimation on block 21 in ResNet50.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Lemma 1