FP=xINT:Representing Neural Networks via Low-Bit Series Basis Functions
Boyang Zhang, Daning Cheng, Yunquan Zhang, Jiake Tian, Jing Li, Fangming Liu
TL;DR
Post-training quantization methods struggle to preserve accuracy at very low bit-widths. The authors propose a deep model series expansion that represents an FP model as a sum of low-bit basis models across tensor, layer, and model levels, with AbelianAdd/Mul enabling parallel, commutative operations. They prove convergence and demonstrate that the expansion can reach or exceed FP accuracy on ResNet-50 at 4-bit (77.03%), and show competitive results on NLP/LLMs without calibration or fine-tuning. The approach yields high parallelism and speed, and can be integrated with existing quantization techniques as basis functions. The work suggests a path toward hardware-friendly, INT-based inference with minimal accuracy loss.
Abstract
Post-Training Quantization (PTQ) converts pre-trained Full-Precision (FP) models into quantized versions without training. While existing methods reduce size and computational costs, they also significantly degrade performance and quantization efficiency at extremely low settings due to quantization noise. We introduce a deep model series expansion framework to address this issue, enabling rapid and accurate approximation of unquantized models without calibration sets or fine-tuning. This is the first use of series expansion for neural network quantization. Specifically, our method expands the FP model into multiple low-bit basis models. To ensure accurate quantization, we develop low-bit basis model expansions at different granularities (tensor, layer, model), and theoretically confirm their convergence to the dense model, thus restoring FP model accuracy. Additionally, we design AbelianAdd/Mul operations between isomorphic models in the low-bit expansion, forming an Abelian group to ensure operation parallelism and commutativity. The experiments show that our algorithm achieves state-of-the-art performance in low-bit settings; for example, 4-bit quantization of ResNet-50 surpasses the original accuracy, reaching 77.03%. The code will be made public.
