Table of Contents
Fetching ...

Frame Quantization of Neural Networks

Wojciech Czaja, Sanghoon Na

TL;DR

This work introduces a data-free post-training quantization method for neural networks based on first-order Sigma-Delta quantization applied to finite unit-norm tight frames (FUNTFs). It provides rigorous error estimates for both feed-forward and residual architectures and demonstrates that increasing frame redundancy or reducing the quantization step improves accuracy, including effective 1-bit quantization with storage benefits. The approach quantizes weight matrices by representing columns (or rows) in frame coordinates, storing compact coefficient data, and using the frame dual to reconstruct quantized weights, with numerical validation on MNIST showing near-original performance. Overall, the method delivers provable error control, practical storage savings, and applicability across common NN topologies, motivating further exploration of frame types and higher-order Sigma-Delta schemes.

Abstract

We present a post-training quantization algorithm with error estimates relying on ideas originating from frame theory. Specifically, we use first-order Sigma-Delta ($ΣΔ$) quantization for finite unit-norm tight frames to quantize weight matrices and biases in a neural network. In our scenario, we derive an error bound between the original neural network and the quantized neural network in terms of step size and the number of frame elements. We also demonstrate how to leverage the redundancy of frames to achieve a quantized neural network with higher accuracy.

Frame Quantization of Neural Networks

TL;DR

This work introduces a data-free post-training quantization method for neural networks based on first-order Sigma-Delta quantization applied to finite unit-norm tight frames (FUNTFs). It provides rigorous error estimates for both feed-forward and residual architectures and demonstrates that increasing frame redundancy or reducing the quantization step improves accuracy, including effective 1-bit quantization with storage benefits. The approach quantizes weight matrices by representing columns (or rows) in frame coordinates, storing compact coefficient data, and using the frame dual to reconstruct quantized weights, with numerical validation on MNIST showing near-original performance. Overall, the method delivers provable error control, practical storage savings, and applicability across common NN topologies, motivating further exploration of frame types and higher-order Sigma-Delta schemes.

Abstract

We present a post-training quantization algorithm with error estimates relying on ideas originating from frame theory. Specifically, we use first-order Sigma-Delta () quantization for finite unit-norm tight frames to quantize weight matrices and biases in a neural network. In our scenario, we derive an error bound between the original neural network and the quantized neural network in terms of step size and the number of frame elements. We also demonstrate how to leverage the redundancy of frames to achieve a quantized neural network with higher accuracy.
Paper Structure (12 sections, 9 theorems, 44 equations, 2 figures, 4 tables, 1 algorithm)

This paper contains 12 sections, 9 theorems, 44 equations, 2 figures, 4 tables, 1 algorithm.

Key Result

Theorem 3.1

benedetto2006sigma Let $F=\{e_1,\cdots,e_N\}$ be a finite unit-norm tight frame for $\mathbb{R}^{d}$, and let $p$ be a permutation of $\{1,2,\cdots,N\}.$ Let $x\in\mathbb{R}^d$ satisfy $\|x\| \le (K-1/2)\delta$ and have the frame expansion $x=\sum_{i=1}^{N}\langle x,e_i \rangle S^{-1} e_i,$ where $S

Figures (2)

  • Figure 1: Worst-case error $\|f(X)-f_Q(X)\|$ for FNN with 3 layers.
  • Figure 2: $\log \mathrm{E}_{X}[\|f(X)-f_Q(X)\|\times N/\delta]$ for FNN with 3 layers.

Theorems & Definitions (17)

  • Definition 3.1
  • Theorem 3.1
  • Theorem 3.2
  • Corollary 3.3
  • proof
  • Lemma 5.1
  • proof
  • Lemma 5.2
  • proof
  • Theorem 5.3
  • ...and 7 more