Three Quantization Regimes for ReLU Networks

Weigutian Ou; Philipp Schenkel; Helmut Bölcskei

Three Quantization Regimes for ReLU Networks

Weigutian Ou, Philipp Schenkel, Helmut Bölcskei

TL;DR

This work establishes nonasymptotic minimax limits for approximating Lipschitz functions on [0,1] by deep ReLU networks with finite-precision weights. It identifies three quantization regimes—under-, proper-, and over-quantization—demonstrating exponential, polynomial, and constant error regimes respectively, and proves memory-optimality in the proper-quantization regime. The authors develop a constructive upper bound using unquantized approximants, then quantize with a refined bit-extraction technique to achieve memory-optimal performance and a depth-precision tradeoff that converts high-precision networks into deeper low-precision equivalents while preserving accuracy. They also derive complementary lower bounds via memory requirements, VC-dimension, and numerical precision, establishing a tight three-regime characterization and guiding design under fixed memory budgets. Collectively, the results advance the theory of ReLU network approximation under finite precision and offer practical insights for hardware-aware neural network quantization and depth-width tradeoffs.

Abstract

We establish the fundamental limits in the approximation of Lipschitz functions by deep ReLU neural networks with finite-precision weights. Specifically, three regimes, namely under-, over-, and proper quantization, in terms of minimax approximation error behavior as a function of network weight precision, are identified. This is accomplished by deriving nonasymptotic tight lower and upper bounds on the minimax approximation error. Notably, in the proper-quantization regime, neural networks exhibit memory-optimality in the approximation of Lipschitz functions. Deep networks have an inherent advantage over shallow networks in achieving memory-optimality. We also develop the notion of depth-precision tradeoff, showing that networks with high-precision weights can be converted into functionally equivalent deeper networks with low-precision weights, while preserving memory-optimality. This idea is reminiscent of sigma-delta analog-to-digital conversion, where oversampling rate is traded for resolution in the quantization of signal samples. We improve upon the best-known ReLU network approximation results for Lipschitz functions and describe a refinement of the bit extraction technique which could be of independent general interest.

Three Quantization Regimes for ReLU Networks

TL;DR

Abstract

Paper Structure (32 sections, 42 theorems, 361 equations, 5 figures)

This paper contains 32 sections, 42 theorems, 361 equations, 5 figures.

Introduction
Definition of key concepts and organization of the paper
Minimax Error Lower Bounds
Upper-bounding the cardinality of $\mathcal{R}_b^1 (W,L)$
Lower-bounding the minimax code length $\ell ( \varepsilon, H^1 ( [0,1] ))$
Lower bound incurred by minimum memory requirement
Two additional lower bounds
A Constructive Minimax Error Upper Bound
Depth-Precision Tradeoff
The Three Quantization Regimes
Proof of Proposition \ref{['prop:lower_bound_approximation_VC_dimension']}
Proof of Theorem \ref{['thm:approximation_lip']}
Preparation for the Proof of Proposition \ref{['prop:approximation_lip_increasing_weights']}
Realizing One-Dimensional Bounded Piecewise Linear Functions by ReLU Networks
Bit Extraction
...and 17 more sections

Key Result

Proposition 2.2

Let $(\mathcal{X}, \delta)$ be a metric space, $\mathcal{Y} \subseteq \mathcal{X}$, and $\varepsilon \in \mathbb{R}_+$. Every finite subset $\mathcal{G} \subseteq \mathcal{X}$ such that $\mathcal{A} ( \mathcal{Y} , \mathcal{G}, \delta ) \leq \varepsilon$, induces an encoder-decoder pair $( E: \math

Figures (5)

Figure 1: The basis $\{ \gamma_i\}_{i =0 }^{M - 1}$ for $\Sigma(X, \infty)$.
Figure 2: The function $f^j_{k,\ell}$.
Figure 3: The function $\rho\circ f^j_{k,\ell}$.
Figure 4: The functions $\rho \circ f^j_{k,\ell}, j =1,2,3$.
Figure 5: $\gamma_{kt + \ell} = \rho\circ f_{k, \ell}^1 - \rho\circ f_{k, \ell}^2 + \rho\circ f_{k, \ell}^3$

Theorems & Definitions (98)

Definition 1.1
Definition 1.2: Minimax (approximation) error
Definition 2.1
Proposition 2.2
proof
Definition 2.3: Memory redundancy and memory optimality
Proposition 2.4
proof
Definition 2.5: Covering number and packing number
Lemma 2.6
...and 88 more

Three Quantization Regimes for ReLU Networks

TL;DR

Abstract

Three Quantization Regimes for ReLU Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (98)