Frame Quantization of Neural Networks

Wojciech Czaja; Sanghoon Na

Frame Quantization of Neural Networks

Wojciech Czaja, Sanghoon Na

TL;DR

This work introduces a data-free post-training quantization method for neural networks based on first-order Sigma-Delta quantization applied to finite unit-norm tight frames (FUNTFs). It provides rigorous error estimates for both feed-forward and residual architectures and demonstrates that increasing frame redundancy or reducing the quantization step improves accuracy, including effective 1-bit quantization with storage benefits. The approach quantizes weight matrices by representing columns (or rows) in frame coordinates, storing compact coefficient data, and using the frame dual to reconstruct quantized weights, with numerical validation on MNIST showing near-original performance. Overall, the method delivers provable error control, practical storage savings, and applicability across common NN topologies, motivating further exploration of frame types and higher-order Sigma-Delta schemes.

Abstract

We present a post-training quantization algorithm with error estimates relying on ideas originating from frame theory. Specifically, we use first-order Sigma-Delta ($ΣΔ$) quantization for finite unit-norm tight frames to quantize weight matrices and biases in a neural network. In our scenario, we derive an error bound between the original neural network and the quantized neural network in terms of step size and the number of frame elements. We also demonstrate how to leverage the redundancy of frames to achieve a quantized neural network with higher accuracy.

Frame Quantization of Neural Networks

TL;DR

Abstract

We present a post-training quantization algorithm with error estimates relying on ideas originating from frame theory. Specifically, we use first-order Sigma-Delta (

) quantization for finite unit-norm tight frames to quantize weight matrices and biases in a neural network. In our scenario, we derive an error bound between the original neural network and the quantized neural network in terms of step size and the number of frame elements. We also demonstrate how to leverage the redundancy of frames to achieve a quantized neural network with higher accuracy.

Paper Structure (12 sections, 9 theorems, 44 equations, 2 figures, 4 tables, 1 algorithm)

This paper contains 12 sections, 9 theorems, 44 equations, 2 figures, 4 tables, 1 algorithm.

Introduction
Preliminaries
Frame quantization
Frame Quantization for Neural Networks
Error Estimates
Feedforward Networks
Neural Networks with Residual Blocks
Numerical Results
Feedforward Network with 3 Layers
Network with 2 Residual Blocks
1-Bit Quantization
Conclusions and Future Work

Key Result

Theorem 3.1

benedetto2006sigma Let $F=\{e_1,\cdots,e_N\}$ be a finite unit-norm tight frame for $\mathbb{R}^{d}$, and let $p$ be a permutation of $\{1,2,\cdots,N\}.$ Let $x\in\mathbb{R}^d$ satisfy $\|x\| \le (K-1/2)\delta$ and have the frame expansion $x=\sum_{i=1}^{N}\langle x,e_i \rangle S^{-1} e_i,$ where $S

Figures (2)

Figure 1: Worst-case error $\|f(X)-f_Q(X)\|$ for FNN with 3 layers.
Figure 2: $\log \mathrm{E}_{X}[\|f(X)-f_Q(X)\|\times N/\delta]$ for FNN with 3 layers.

Theorems & Definitions (17)

Definition 3.1
Theorem 3.1
Theorem 3.2
Corollary 3.3
proof
Lemma 5.1
proof
Lemma 5.2
proof
Theorem 5.3
...and 7 more

Frame Quantization of Neural Networks

TL;DR

Abstract

Frame Quantization of Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (17)