Frame Quantization of Neural Networks
Wojciech Czaja, Sanghoon Na
TL;DR
This work introduces a data-free post-training quantization method for neural networks based on first-order Sigma-Delta quantization applied to finite unit-norm tight frames (FUNTFs). It provides rigorous error estimates for both feed-forward and residual architectures and demonstrates that increasing frame redundancy or reducing the quantization step improves accuracy, including effective 1-bit quantization with storage benefits. The approach quantizes weight matrices by representing columns (or rows) in frame coordinates, storing compact coefficient data, and using the frame dual to reconstruct quantized weights, with numerical validation on MNIST showing near-original performance. Overall, the method delivers provable error control, practical storage savings, and applicability across common NN topologies, motivating further exploration of frame types and higher-order Sigma-Delta schemes.
Abstract
We present a post-training quantization algorithm with error estimates relying on ideas originating from frame theory. Specifically, we use first-order Sigma-Delta ($ΣΔ$) quantization for finite unit-norm tight frames to quantize weight matrices and biases in a neural network. In our scenario, we derive an error bound between the original neural network and the quantized neural network in terms of step size and the number of frame elements. We also demonstrate how to leverage the redundancy of frames to achieve a quantized neural network with higher accuracy.
