Table of Contents
Fetching ...

SigmaQuant: Hardware-Aware Heterogeneous Quantization Method for Edge DNN Inference

Qunyou Liu, Pengbo Yu, Marina Zapater, David Atienza

TL;DR

This work introduces SigmaQuant, an adaptive layer-wise heterogeneous quantization framework designed to efficiently balance accuracy and resource usage for varied edge environments without exhaustive search.

Abstract

Deep neural networks (DNNs) are essential for performing advanced tasks on edge or mobile devices, yet their deployment is often hindered by severe resource constraints, including limited memory, energy, and computational power. While uniform quantization provides a straightforward approach to compress model and reduce hardware requirement, it fails to fully leverage the varying robustness across layers, and often lead to accuracy degradation or suboptimal resource usage, particularly at low bitwidths. In contrast, heterogeneous quantization, which allocates different bitwidths to individual layers, can mitigate these drawbacks. Nonetheless, current heterogeneous quantization methods either needs huge brute-force design space search or lacks the adaptability to meet different hardware conditions, such as memory size, energy budget, and latency requirement. Filling these gaps, this work introduces \textbf{\textit{SigmaQuant}}, an adaptive layer-wise heterogeneous quantization framework designed to efficiently balance accuracy and resource usage for varied edge environments without exhaustive search.

SigmaQuant: Hardware-Aware Heterogeneous Quantization Method for Edge DNN Inference

TL;DR

This work introduces SigmaQuant, an adaptive layer-wise heterogeneous quantization framework designed to efficiently balance accuracy and resource usage for varied edge environments without exhaustive search.

Abstract

Deep neural networks (DNNs) are essential for performing advanced tasks on edge or mobile devices, yet their deployment is often hindered by severe resource constraints, including limited memory, energy, and computational power. While uniform quantization provides a straightforward approach to compress model and reduce hardware requirement, it fails to fully leverage the varying robustness across layers, and often lead to accuracy degradation or suboptimal resource usage, particularly at low bitwidths. In contrast, heterogeneous quantization, which allocates different bitwidths to individual layers, can mitigate these drawbacks. Nonetheless, current heterogeneous quantization methods either needs huge brute-force design space search or lacks the adaptability to meet different hardware conditions, such as memory size, energy budget, and latency requirement. Filling these gaps, this work introduces \textbf{\textit{SigmaQuant}}, an adaptive layer-wise heterogeneous quantization framework designed to efficiently balance accuracy and resource usage for varied edge environments without exhaustive search.
Paper Structure (21 sections, 9 equations, 5 figures, 6 tables, 1 algorithm)

This paper contains 21 sections, 9 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: (a) The operation hierarchy of AI models from top to down, and (b) the widely used shift-add-based multiplier for edge accelerators that target at high energy efficiency.
  • Figure 2: Overview of our proposed distribution-fitting quantization method. We start from user-defined boundary conditions (target model size and accuracy) and adapt bitwidths in two phases: initial clustering by standard deviation followed by iterative KL-based refinement.
  • Figure 3: Example showing how training advances through the two-phase quantization for ResNet34. The x-axis represents corrected model sizes and the y-axis represents model accuracy. Different points indicate successive stages in the cluster phase (Phase 1) and iteration phase ((Phase 2), with the final quantized model landing in the target area.
  • Figure 4: (a) Comparison of Top-1 accuracy versus model size for various ResNet architectures on CIFAR-100, where darker markers denote the sigma-based method and lighter markers denote uniform quantization. (b) Regression fits with ±1$\sigma$ error bands reveal that the sigma approach consistently achieves higher accuracy at equivalent model sizes.
  • Figure 5: Normalized energy consumption (top) and cycle count (hence latency, bottom)