HQP: Sensitivity-Aware Hybrid Quantization and Pruning for Ultra-Low-Latency Edge AI Inference
Dinesh Gopalan, Ratul Ali
TL;DR
The paper tackles the challenge of enabling ultra-low-latency edge AI under strict latency and energy constraints by coordinating pruning and quantization. It introduces HQP, a sensitivity-aware Hybrid Quantization and Pruning framework that uses a diagonal Fisher Information Matrix approximation to derive a global saliency metric $S$, prunes iteratively under a quality bound $A_{\text{baseline}} - A^{(t)} \le \Delta_{\text{ax}}$, and then applies robust PTQ to obtain a sparse INT8 model. The approach achieves substantial performance gains (up to $3.12\times$ speedup) and model size reductions (up to $55\%$) while keeping accuracy loss within $\Delta_{\text{ax}} = 1.5\%$, outperforming single-objective compression baselines on MobileNetV3 and ResNet-18 across NVIDIA Jetson platforms. This work demonstrates HQP’s hardware-agnostic practicality, end-to-end deployability with TensorRT, and potential for extending to mixed-precision quantization and transformer architectures.
Abstract
The escalating demand for high-fidelity, real-time inference in distributed edge-cloud environments necessitates aggressive model optimization to counteract severe latency and energy constraints. This paper introduces the Hybrid Quantization and Pruning (HQP) framework, a novel, integrated methodology designed to achieve synergistic model acceleration while adhering to strict quality guarantees. We detail a sensitivity-aware structural pruning algorithm that employs a dynamic weight sensitivity metric, derived from a highly efficient approximation of the Fisher Information Matrix (FIM), to guide the iterative removal of redundant filters. This pruning is strictly conditional, enforcing an adherence to a maximum permissible accuracy drop (Delta ax) before the model proceeds to 8-bit post-training quantization. This rigorous coordination is critical, as it ensures the resultant sparse model structure is maximally robust to quantization error and hardware-specific kernel optimization. Exhaustive evaluation across heterogeneous NVIDIA Jetson edge platforms, utilizing resource-efficient architectures like MobileNetV3 and ResNet-18, demonstrates that the HQP framework achieves a peak performance gain of 3.12 times inference speedup and a 55 percent model size reduction, while rigorously containing the accuracy drop below the 1.5 percent constraint. A comprehensive comparative analysis against conventional single-objective compression techniques validates the HQP framework as a superior, hardware-agnostic solution for deploying ultra-low-latency AI in resource-limited edge infrastructures.
