On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks

Wei Huang; Haotong Qin; Yangdong Liu; Jingzhuo Liang; Yulun Zhang; Ying Li; Xianglong Liu

On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks

Wei Huang, Haotong Qin, Yangdong Liu, Jingzhuo Liang, Yulun Zhang, Ying Li, Xianglong Liu

TL;DR

This work tackles edge deployment of quantized neural networks by introducing OHQ, a fully on-chip hardware-aware mixed-precision quantization framework. It combines On-Chip Quantization Awareness (OQA), which measures true hardware efficiency metrics at IP-core granularity using synthetic data, with Mask-Guided Quantization Estimation (MQE), which estimates layerwise accuracy impact via masking and KL divergence, all under an ILP-driven bit-width selection. The approach achieves competitive accuracy and latency improvements on ResNet and MobileNet variants when deployed on real FPGA hardware, demonstrating the practicality of fully on-chip mixed-precision quantization. While promising, OHQ acknowledges a gap to full-precision accuracy at strong compression and highlights the need for broader hardware validation and refinement.

Abstract

Low-bit quantization emerges as one of the most promising compression approaches for deploying deep neural networks on edge devices. Mixed-precision quantization leverages a mixture of bit-widths to unleash the accuracy and efficiency potential of quantized models. However, existing mixed-precision quantization methods rely on simulations in high-performance devices to achieve accuracy and efficiency trade-offs in immense search spaces. This leads to a non-negligible gap between the estimated efficiency metrics and the actual hardware that makes quantized models far away from the optimal accuracy and efficiency, and also causes the quantization process to rely on additional high-performance devices. In this paper, we propose an On-Chip Hardware-Aware Quantization (OHQ) framework, performing hardware-aware mixed-precision quantization on deployed edge devices to achieve accurate and efficient computing. Specifically, for efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator and avoid optimization errors caused by inaccurate simulation. For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario, getting rid of the dependence of the quantization process on high computing power. By synthesizing insights from quantized models and hardware through linear optimization, we can obtain optimized bit-width configurations to achieve outstanding performance on accuracy and efficiency. We evaluate inference accuracy and acceleration with quantization for various architectures and compression ratios on hardware. OHQ achieves 70% and 73% accuracy for ResNet-18 and MobileNetV3, respectively, and can reduce latency by 15~30% compared to INT8 on real deployment.

On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks

TL;DR

Abstract

Paper Structure (26 sections, 12 equations, 7 figures, 3 tables, 2 algorithms)

This paper contains 26 sections, 12 equations, 7 figures, 3 tables, 2 algorithms.

Introduction
Related Work
Methodology
On-Chip Quantization Awareness
Data Preparation for Quantization
On-Chip Awareness of Hardware Metrics
Mask-Guided Quantization Estimation
On-Chip Awareness of Layerwise Sensitivity
Mixed-Precision Quantization Strategy
Experiment
Implementation Details
Comparison Results
Ablation Results
Hardware-aware Parameters
Optimization Factor Deconstruction
...and 11 more sections

Figures (7)

Figure 1: Off-chip vs. on-chip quantization. The left shows the traditional off-chip quantization framework involving quantization analysis and deployment steps. The right part is our OHQ framework, which is fully integrated on-chip.
Figure 2: The overview of OHQ framework. This proposed OHQ obtains chip-level sensing parameters and layer-wise differences through a physical deployment (OQA and MQE are respectively described in detail in Fig. \ref{['fig:Hardware_Awareness']} and Fig. \ref{['fig:sensitivity']}).
Figure 3: The workflow of OQA. (Top) The PL part samples time, power, and other information of four main steps for awareness while computing, which use BRAM to optimize matrix multiplication and data transfer. (Bottom) The PS part controls the whole situation, including accessing data, organizing the network, and instructing IP cores.
Figure 4: Illustration of MQE for ResNet18. Specifically, we feed synthesized data into on-chip models. The figure shows the model with the $5$-th layer specifically masked out.
Figure 5: Comparison of MobileNetV3's on-chip awareness characteristics and network parameters of each layer. The first image on the left demonstrates the relationship between the model accuracy (in orange) post-application of layer-wise masking and the sensitivity of that layer (in blue). The central image depicts the relationship between the on-chip computational clock (in yellow) for each layer and the number of parameters in the layer (in green). The image on the right presents the relationship between the on-chip power consumption (in red) and the number of parameters.
...and 2 more figures

On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks

TL;DR

Abstract

On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Figures (7)