Adaptive quantization with mixed-precision based on low-cost proxy
Junzhe Chen, Qiao Yang, Senmao Tian, Shunli Zhang
TL;DR
This work tackles the challenge of deploying deep networks on resource-constrained hardware by proposing LCPAQ, a hardware-aware adaptive mixed-precision quantization framework. The approach combines a hardware-aware ILP module, Hessian-trace based layer sensitivity, Pareto-frontier refinement, and a low-cost proxy NAS to efficiently search hyperparameters, with distillation to mitigate quantization loss. Key contributions include (i) Hessian-trace based sensitivity for per-layer bit-width decisions, (ii) an ILP optimization under model size, BOPs, and latency constraints, (iii) a low-cost proxy NAS to accelerate hyperparameter exploration, and (iv) strong ImageNet results with ~1/200 search-time compared to full search. The method enables practical deployment of mixed-precision quantized models on devices with limited compute and memory, delivering competitive accuracy with significantly reduced search overhead.
Abstract
It is critical to deploy complicated neural network models on hardware with limited resources. This paper proposes a novel model quantization method, named the Low-Cost Proxy-Based Adaptive Mixed-Precision Model Quantization (LCPAQ), which contains three key modules. The hardware-aware module is designed by considering the hardware limitations, while an adaptive mixed-precision quantization module is developed to evaluate the quantization sensitivity by using the Hessian matrix and Pareto frontier techniques. Integer linear programming is used to fine-tune the quantization across different layers. Then the low-cost proxy neural architecture search module efficiently explores the ideal quantization hyperparameters. Experiments on the ImageNet demonstrate that the proposed LCPAQ achieves comparable or superior quantization accuracy to existing mixed-precision models. Notably, LCPAQ achieves 1/200 of the search time compared with existing methods, which provides a shortcut in practical quantization use for resource-limited devices.
