HQP: Sensitivity-Aware Hybrid Quantization and Pruning for Ultra-Low-Latency Edge AI Inference

Dinesh Gopalan; Ratul Ali

HQP: Sensitivity-Aware Hybrid Quantization and Pruning for Ultra-Low-Latency Edge AI Inference

Dinesh Gopalan, Ratul Ali

TL;DR

The paper tackles the challenge of enabling ultra-low-latency edge AI under strict latency and energy constraints by coordinating pruning and quantization. It introduces HQP, a sensitivity-aware Hybrid Quantization and Pruning framework that uses a diagonal Fisher Information Matrix approximation to derive a global saliency metric $S$, prunes iteratively under a quality bound $A_{\text{baseline}} - A^{(t)} \le \Delta_{\text{ax}}$, and then applies robust PTQ to obtain a sparse INT8 model. The approach achieves substantial performance gains (up to $3.12\times$ speedup) and model size reductions (up to $55\%$) while keeping accuracy loss within $\Delta_{\text{ax}} = 1.5\%$, outperforming single-objective compression baselines on MobileNetV3 and ResNet-18 across NVIDIA Jetson platforms. This work demonstrates HQP’s hardware-agnostic practicality, end-to-end deployability with TensorRT, and potential for extending to mixed-precision quantization and transformer architectures.

Abstract

The escalating demand for high-fidelity, real-time inference in distributed edge-cloud environments necessitates aggressive model optimization to counteract severe latency and energy constraints. This paper introduces the Hybrid Quantization and Pruning (HQP) framework, a novel, integrated methodology designed to achieve synergistic model acceleration while adhering to strict quality guarantees. We detail a sensitivity-aware structural pruning algorithm that employs a dynamic weight sensitivity metric, derived from a highly efficient approximation of the Fisher Information Matrix (FIM), to guide the iterative removal of redundant filters. This pruning is strictly conditional, enforcing an adherence to a maximum permissible accuracy drop (Delta ax) before the model proceeds to 8-bit post-training quantization. This rigorous coordination is critical, as it ensures the resultant sparse model structure is maximally robust to quantization error and hardware-specific kernel optimization. Exhaustive evaluation across heterogeneous NVIDIA Jetson edge platforms, utilizing resource-efficient architectures like MobileNetV3 and ResNet-18, demonstrates that the HQP framework achieves a peak performance gain of 3.12 times inference speedup and a 55 percent model size reduction, while rigorously containing the accuracy drop below the 1.5 percent constraint. A comprehensive comparative analysis against conventional single-objective compression techniques validates the HQP framework as a superior, hardware-agnostic solution for deploying ultra-low-latency AI in resource-limited edge infrastructures.

HQP: Sensitivity-Aware Hybrid Quantization and Pruning for Ultra-Low-Latency Edge AI Inference

TL;DR

, prunes iteratively under a quality bound

, and then applies robust PTQ to obtain a sparse INT8 model. The approach achieves substantial performance gains (up to

speedup) and model size reductions (up to

) while keeping accuracy loss within

, outperforming single-objective compression baselines on MobileNetV3 and ResNet-18 across NVIDIA Jetson platforms. This work demonstrates HQP’s hardware-agnostic practicality, end-to-end deployability with TensorRT, and potential for extending to mixed-precision quantization and transformer architectures.

Abstract

Paper Structure (25 sections, 14 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 25 sections, 14 equations, 3 figures, 2 tables, 1 algorithm.

INTRODUCTION
RELATED WORK AND THEORETICAL FOUNDATIONS
Historical Context and Detailed Critique of Pruning Techniques
The Fisher Information Matrix (FIM) as a Global Saliency Metric
Quantization Strategies and the Pruning-Quantization Conflict
PROPOSED HYBRID Q&P FRAMEWORK: ALGORITHMIC AND MATHEMATICAL RIGOR
Formal Derivations of Sensitivity and Control
Detailed Conditional Iterative Pruning Algorithm
Formal Computational Complexity Analysis
EXPERIMENTAL SETUP AND METHODOLOGY
Heterogeneous Edge Hardware Architecture and Runtime Environments
Implementation Specifics of the HQP Framework
Model Architectures and Dataset Validation Protocols
RESULTS AND DISCUSSION: COMPREHENSIVE QUANTITATIVE AND QUALITATIVE ANALYSIS
Primary Performance Evaluation on MobileNetV3 (Jetson Xavier NX)
...and 10 more sections

Figures (3)

Figure 1: Proposed Hybrid Quantization and Pruning (HQP) Framework Architecture.
Figure 2: Performance Comparison of Optimization Methods on MobileNetV3 (Latency and Accuracy)
Figure 3: Model Size Reduction vs. Accuracy Drop Across Optimization Methods

HQP: Sensitivity-Aware Hybrid Quantization and Pruning for Ultra-Low-Latency Edge AI Inference

TL;DR

Abstract

HQP: Sensitivity-Aware Hybrid Quantization and Pruning for Ultra-Low-Latency Edge AI Inference

Authors

TL;DR

Abstract

Table of Contents

Figures (3)