Stabilizing Quantization-Aware Training by Implicit-Regularization on Hessian Matrix
Junbiao Pang, Tianyang Cai
TL;DR
This work addresses the instability of Quantization-Aware Training (QAT) caused by unavoidable quantization errors that induce sharp loss landscapes. It proposes Feature-Perturbed Quantization (FPQ), which injects stochastic perturbations into layer inputs (via Stochastic Feature Perturbations) and employs Channel-wise Standardization Distillation to align feature distributions between quantized and full-precision models, thereby implicitly regularizing the Hessian and promoting flat minima. Theoretical and empirical analyses show that FPQ reduces the Hessian norm (e.g., $\,\operatorname{Tr}(\nabla_{\boldsymbol{w}}^2 L)$) and outperforms state-of-the-art QAT methods and FP baselines across CIFAR-10/100 on multiple architectures, with ablations confirming the additive benefits of perturbations and distillation. The results suggest FPQ offers a practical, architecture-agnostic improvement for stable, high-accuracy quantized models, advancing the deployment of efficient neural networks on edge devices.
Abstract
Quantization-Aware Training (QAT) is one of the prevailing neural network compression solutions. However, its stability has been challenged for yielding deteriorating performances as the quantization error is inevitable. We find that the sharp landscape of loss, which leads to a dramatic performance drop, is an essential factor that causes instability. Theoretically, we have discovered that the perturbations in the feature would bring a flat local minima. However, simply adding perturbations into either weight or feature empirically deteriorates the performance of the Full Precision (FP) model. In this paper, we propose Feature-Perturbed Quantization (FPQ) to stochastically perturb the feature and employ the feature distillation method to the quantized model. Our method generalizes well to different network architectures and various QAT methods. Furthermore, we mathematically show that FPQ implicitly regularizes the Hessian norm, which calibrates the smoothness of a loss landscape. Extensive experiments demonstrate that our approach significantly outperforms the current State-Of-The-Art (SOTA) QAT methods and even the FP counterparts.
