Table of Contents
Fetching ...

New advances in universal approximation with neural networks of minimal width

Dennis Rochau, Robin Chan, Hanno Gottschalk

TL;DR

This work advances the theory of universal approximation for narrow neural networks by establishing explicit minimal-width results for $L^p$ and uniform continuous-function approximation on compact sets, using Leaky ReLU variants, FLOOR, and related activations. It develops a constructive coding scheme to realize UA at widths $w_{ ext{min}}= ext{max}\{d_x,d_y ightar}$ (or $2$ depending on activation) and demonstrates that autoencoders with one-dimensional features are universal in the $L^p$ sense. The paper further extends UA to LU-decomposable invertible networks (LU-Net), proves distributional universal approximation (DUAP), and shows that smoothed LU networks realize diffeomorphisms that universal-approximate $L^p$ maps, linking to Brenier–Gangbo. A sharp lower bound for continuous monotone activations is provided, highlighting fundamental width limitations and the necessity of discontinuities or non-monotonicity for minimal-width UA in certain regimes. Collectively, the results deepen understanding of the capacity-width tradeoffs in neural networks and provide constructive tools for normalizing flows and diffeomorphic transport, with implications for both theory and practice.

Abstract

We prove several universal approximation results at minimal or near-minimal width for approximation of $L^p(\mathbb{R}^{d_x}, \mathbb{R}^{d_y})$ and $C^0(\mathbb{R}^{d_x}, \mathbb{R}^{d_y})$ on compact sets. Our approach uses a unified coding scheme that yields explicit constructions relying only on standard analytic tools. We show that feedforward neural networks with two leaky ReLU activations $σ_α$, $σ_{-α}$ achieve the optimal width $\max\{d_x, d_y\}$ for $L^p$ approximation, while a single leaky ReLU $σ_α$ achieves width $\max\{2, d_x, d_y\}$, providing an alternative proof of the results of Cai et al. (2023). By generalizing to stepped leaky ReLU activations, we extend these results to uniform approximation of continuous functions while identifying sets of activation functions compatible with gradient-based training. Since our constructions pass through an intermediate dimension of one, they imply that autoencoders with a one-dimensional feature space are universal approximators. We further show that squashable activations combined with FLOOR achieve width $\max\{3, d_x, d_y\}$ for uniform approximation. We also establish a lower bound of $\max\{d_x, d_y\} + 1$ for networks when all activations are continuous and monotone and $d_y \leq 2d_x$. Moreover, we extend our results to invertible LU-decomposable networks, proving distributional universal approximation for LU-Net normalizing flows and providing a constructive proof of the classical theorem of Brenier and Gangbo on $L^p$ approximation by diffeomorphisms.

New advances in universal approximation with neural networks of minimal width

TL;DR

This work advances the theory of universal approximation for narrow neural networks by establishing explicit minimal-width results for and uniform continuous-function approximation on compact sets, using Leaky ReLU variants, FLOOR, and related activations. It develops a constructive coding scheme to realize UA at widths (or depending on activation) and demonstrates that autoencoders with one-dimensional features are universal in the sense. The paper further extends UA to LU-decomposable invertible networks (LU-Net), proves distributional universal approximation (DUAP), and shows that smoothed LU networks realize diffeomorphisms that universal-approximate maps, linking to Brenier–Gangbo. A sharp lower bound for continuous monotone activations is provided, highlighting fundamental width limitations and the necessity of discontinuities or non-monotonicity for minimal-width UA in certain regimes. Collectively, the results deepen understanding of the capacity-width tradeoffs in neural networks and provide constructive tools for normalizing flows and diffeomorphic transport, with implications for both theory and practice.

Abstract

We prove several universal approximation results at minimal or near-minimal width for approximation of and on compact sets. Our approach uses a unified coding scheme that yields explicit constructions relying only on standard analytic tools. We show that feedforward neural networks with two leaky ReLU activations , achieve the optimal width for approximation, while a single leaky ReLU achieves width , providing an alternative proof of the results of Cai et al. (2023). By generalizing to stepped leaky ReLU activations, we extend these results to uniform approximation of continuous functions while identifying sets of activation functions compatible with gradient-based training. Since our constructions pass through an intermediate dimension of one, they imply that autoencoders with a one-dimensional feature space are universal approximators. We further show that squashable activations combined with FLOOR achieve width for uniform approximation. We also establish a lower bound of for networks when all activations are continuous and monotone and . Moreover, we extend our results to invertible LU-decomposable networks, proving distributional universal approximation for LU-Net normalizing flows and providing a constructive proof of the classical theorem of Brenier and Gangbo on approximation by diffeomorphisms.

Paper Structure

This paper contains 20 sections, 47 theorems, 145 equations, 9 figures, 4 tables.

Key Result

Lemma 9

Let $\alpha \in (0,1) \cup (1,\infty)$, $\sigma_{\alpha}$ the corresponding LReLU, $f \in C_{\mathop{\mathrm{mon}}\nolimits}^{0}(\mathbb{R},\mathbb{R})$, and $\mathcal{K} \subset \mathbb{R}$ compact. Then there exists $(k_n)_{n \in \mathbb{N}} \subset \mathbb{N}$ with $\lim_{n \to \infty} k_n = \inf

Figures (9)

  • Figure 1: Coding scheme approximations from Theorem \ref{['Theorem-Main1']} for $f(x) = 0.6e^{-50(x-0.3)^2} + 0.6e^{-30(x-0.7)^2} + 0.35\cdot\mathbbm{1}_{[0.55,\infty)}(x)$ using G-LReLU activations from $\mathcal{F}_{\pm}$ with a FNN of width $1$. Left: $K=3$, $M=8$, $\gamma=0.05$. Right: $K=4$, $M=14$, $\gamma=0.01$. Here $K, M$ are the coding scheme parameters (Definition \ref{['Definition-Coding_scheme']}) and $\gamma$ is the width of the exceptional intervals of the quantizer approximation with large error magnitudes (Lemma \ref{['Lemma-Approximate_q_k']}).
  • Figure 2: Coding scheme approximations from Theorem \ref{['Theorem-Main1']} for $f(x) = 0.6e^{-50(x-0.3)^2} + 0.6e^{-30(x-0.7)^2} + 0.35\cdot\mathbbm{1}_{[0.55,\infty)}(x)$ using LReLU activations from $\mathcal{F}_{+}$ and a FNN of width 2. Left: $K=3$, $M=8$, $\gamma=0.05$. Right: $K=4$, $M=14$, $\gamma=0.01$. Here $K, M$ are the coding scheme parameters (Definition \ref{['Definition-Coding_scheme']}) and $\gamma$ is the width of the exceptional intervals of the quantizer approximation with large error magnitudes (Lemma \ref{['Lemma-Approximate_q_k']}).
  • Figure 3: Coding scheme approximations from Theorem \ref{['Theorem-Main_sup']} for $f(x) = -3(x - 0.5)^2 + 0.9$ using SG-LReLU activations from $\mathcal{F}_{\pm,\mathfrak{s}}$ with a width-1 FNN. Left: $K=3$, $M=4$, $\alpha=0.02$. Right: $K=4$, $M=8$, $\alpha=0.001$. Here $K, M$ are the coding scheme parameters (Definition \ref{['Definition-Coding_scheme']}) and $\alpha$ is the slope of the quantizer approximation (Lemma \ref{['Lemma-Approximate_q_k_stepped']}).
  • Figure 4: A visualization of the coding scheme on an explicit example: Note that the considered $f^{*}$ fulfills $f^{*}(\mathfrak{q}_4(x))=(0.011,\hdots,0.110)^T$, which leads to the shown output of the memorizer. Moreover the parameter of the encoder needs to coincide with the first parameter of the memorizer and the second parameter of the memorizer needs to coincide with the parameter of the decoder for the coding scheme to be well-defined. For details about the different parts of the coding scheme we refer the reader to Definition \ref{['Definition-Coding_scheme']}.
  • Figure 5: Quantizer approximations from Lemma \ref{['Lemma-Approximate_q_k']}. Left: $K=2$, $\alpha=0.35$, $\gamma=0.03$. Right: $K=3$, $\alpha=0.1$, $\gamma=0.01$. Here $\alpha$ is the flat slope where the approximation closely matches the true quantizer, and $\gamma$ is the width of the exceptional intervals with larger error. To obtain our approximating quantizers, we decrease $\alpha$ and $\gamma$ to $0$ while increasing $K$ to infinity.
  • ...and 4 more figures

Theorems & Definitions (89)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Definition 6
  • Definition 7
  • Definition 8
  • Lemma 9
  • Theorem 10
  • ...and 79 more