Table of Contents
Fetching ...

FP=xINT:Representing Neural Networks via Low-Bit Series Basis Functions

Boyang Zhang, Daning Cheng, Yunquan Zhang, Jiake Tian, Jing Li, Fangming Liu

TL;DR

Post-training quantization methods struggle to preserve accuracy at very low bit-widths. The authors propose a deep model series expansion that represents an FP model as a sum of low-bit basis models across tensor, layer, and model levels, with AbelianAdd/Mul enabling parallel, commutative operations. They prove convergence and demonstrate that the expansion can reach or exceed FP accuracy on ResNet-50 at 4-bit (77.03%), and show competitive results on NLP/LLMs without calibration or fine-tuning. The approach yields high parallelism and speed, and can be integrated with existing quantization techniques as basis functions. The work suggests a path toward hardware-friendly, INT-based inference with minimal accuracy loss.

Abstract

Post-Training Quantization (PTQ) converts pre-trained Full-Precision (FP) models into quantized versions without training. While existing methods reduce size and computational costs, they also significantly degrade performance and quantization efficiency at extremely low settings due to quantization noise. We introduce a deep model series expansion framework to address this issue, enabling rapid and accurate approximation of unquantized models without calibration sets or fine-tuning. This is the first use of series expansion for neural network quantization. Specifically, our method expands the FP model into multiple low-bit basis models. To ensure accurate quantization, we develop low-bit basis model expansions at different granularities (tensor, layer, model), and theoretically confirm their convergence to the dense model, thus restoring FP model accuracy. Additionally, we design AbelianAdd/Mul operations between isomorphic models in the low-bit expansion, forming an Abelian group to ensure operation parallelism and commutativity. The experiments show that our algorithm achieves state-of-the-art performance in low-bit settings; for example, 4-bit quantization of ResNet-50 surpasses the original accuracy, reaching 77.03%. The code will be made public.

FP=xINT:Representing Neural Networks via Low-Bit Series Basis Functions

TL;DR

Post-training quantization methods struggle to preserve accuracy at very low bit-widths. The authors propose a deep model series expansion that represents an FP model as a sum of low-bit basis models across tensor, layer, and model levels, with AbelianAdd/Mul enabling parallel, commutative operations. They prove convergence and demonstrate that the expansion can reach or exceed FP accuracy on ResNet-50 at 4-bit (77.03%), and show competitive results on NLP/LLMs without calibration or fine-tuning. The approach yields high parallelism and speed, and can be integrated with existing quantization techniques as basis functions. The work suggests a path toward hardware-friendly, INT-based inference with minimal accuracy loss.

Abstract

Post-Training Quantization (PTQ) converts pre-trained Full-Precision (FP) models into quantized versions without training. While existing methods reduce size and computational costs, they also significantly degrade performance and quantization efficiency at extremely low settings due to quantization noise. We introduce a deep model series expansion framework to address this issue, enabling rapid and accurate approximation of unquantized models without calibration sets or fine-tuning. This is the first use of series expansion for neural network quantization. Specifically, our method expands the FP model into multiple low-bit basis models. To ensure accurate quantization, we develop low-bit basis model expansions at different granularities (tensor, layer, model), and theoretically confirm their convergence to the dense model, thus restoring FP model accuracy. Additionally, we design AbelianAdd/Mul operations between isomorphic models in the low-bit expansion, forming an Abelian group to ensure operation parallelism and commutativity. The experiments show that our algorithm achieves state-of-the-art performance in low-bit settings; for example, 4-bit quantization of ResNet-50 surpasses the original accuracy, reaching 77.03%. The code will be made public.

Paper Structure

This paper contains 14 sections, 2 theorems, 8 equations, 4 figures, 6 tables.

Key Result

Theorem 1

$M=M_{sa}+bias*M_{nsy}+\sum_{i=1}^{n}scale_i*\widetilde{M}_{i}$, where $M_{sa}$ is a sparse float tensor which is produced by saturation quantization, $M_{nsy}$ is the tensor whose all elements are 1. $\widetilde{M}_{i}$ is the tensor whose all elements are INT(X) data type and $scale_i = 2^X*scale_

Figures (4)

  • Figure 1: The general form of series of a function $f(x)$. Usually, $h_i(x)$ is the computation-friendly function. The addition operation is a parallel-friendly operation. We expect the convergence speed of series expansion to be fast, which means $f(x)-\sum_{i=1}^n h_i(x)$ is small enough when $n$ is not too large.
  • Figure 2: The expansion of tensor multiplication, where $A$ and $W$ are $n*n$ tensor. The black point grid is produced by saturation quantization. The blue point grid is produced by non-symmetry quantization. The black and blue grid is optional. The red point is required by all quantization methods. The influence of black point grids is small in practice in the view of model performance. The blue grid computation complexity is small which is $O_{INTX}(n^2)$.
  • Figure 3: Our series expansion at different levels and specific operation details. Finally, the FP model is expanded into the sum of multiple INT models.
  • Figure 4: The left sub-figure(a) shows our experiments on saturation and asymmetric quantization. The right sub-figure(b) shows the changes in loss and accuracy as the number of expansions increases.

Theorems & Definitions (6)

  • Theorem 1
  • proof
  • Definition 1
  • Definition 2
  • Theorem 2
  • proof