Table of Contents
Fetching ...

Minimum width for universal approximation using squashable activation functions

Jonghyun Shin, Namjun Kim, Geonho Hwang, Sejun Park

TL;DR

This work characterizes the minimum width required for universal approximation with general activation functions by introducing squashable activations, which can approximate both the identity and the Step function via affine compositions. It proves that the minimum width satisfies $w_{ m min}=\max\{d_x,d_y\}$ for squashable activations (except the trivial $(d_x,d_y)=(1,1)$ case, where $w_{ m min}\in\{1,2\}$, and $w_{ m min}=2$ if the activation is monotone), thereby extending ReLU-based results to a broad class of activations. The authors establish that all non-affine analytic activations and many piecewise differentiable activations are squashable, and they provide easily verifiable criteria for squashability via a width-1 approximation of Step and a sigmoidal construction. Their encoder-decoder framework, together with a delta-filling curve construction, enables universal approximation with width $\max\{d_x,d_y,2\}$, offering a general approach to narrow, expressive networks across a wide range of activation functions.

Abstract

The exact minimum width that allows for universal approximation of unbounded-depth networks is known only for ReLU and its variants. In this work, we study the minimum width of networks using general activation functions. Specifically, we focus on squashable functions that can approximate the identity function and binary step function by alternatively composing with affine transformations. We show that for networks using a squashable activation function to universally approximate $L^p$ functions from $[0,1]^{d_x}$ to $\mathbb R^{d_y}$, the minimum width is $\max\{d_x,d_y,2\}$ unless $d_x=d_y=1$; the same bound holds for $d_x=d_y=1$ if the activation function is monotone. We then provide sufficient conditions for squashability and show that all non-affine analytic functions and a class of piecewise functions are squashable, i.e., our minimum width result holds for those general classes of activation functions.

Minimum width for universal approximation using squashable activation functions

TL;DR

This work characterizes the minimum width required for universal approximation with general activation functions by introducing squashable activations, which can approximate both the identity and the Step function via affine compositions. It proves that the minimum width satisfies for squashable activations (except the trivial case, where , and if the activation is monotone), thereby extending ReLU-based results to a broad class of activations. The authors establish that all non-affine analytic activations and many piecewise differentiable activations are squashable, and they provide easily verifiable criteria for squashability via a width-1 approximation of Step and a sigmoidal construction. Their encoder-decoder framework, together with a delta-filling curve construction, enables universal approximation with width , offering a general approach to narrow, expressive networks across a wide range of activation functions.

Abstract

The exact minimum width that allows for universal approximation of unbounded-depth networks is known only for ReLU and its variants. In this work, we study the minimum width of networks using general activation functions. Specifically, we focus on squashable functions that can approximate the identity function and binary step function by alternatively composing with affine transformations. We show that for networks using a squashable activation function to universally approximate functions from to , the minimum width is unless ; the same bound holds for if the activation function is monotone. We then provide sufficient conditions for squashability and show that all non-affine analytic functions and a class of piecewise functions are squashable, i.e., our minimum width result holds for those general classes of activation functions.

Paper Structure

This paper contains 29 sections, 19 theorems, 109 equations, 4 figures, 1 table.

Key Result

Lemma 1

For any $\varepsilon>0$, $\sigma:\mathbb{R}\to\mathbb{R}$ satisfying cond:id, and compact set $\mathcal{K} \subset \mathbb{R}$, there exist affine transformations $h_1, h_2:\mathcal{K}\to\mathbb{R}$ such that

Figures (4)

  • Figure 1: Illustration of construction of squashable function using a $\sigma$ network $\rho$ of width $1$ that has a sigmoidal shape when $\phi(x)=x$. The intersections of $\rho(x)$ and $\phi(x)$ serve as fixed points. Thus, $\sigma$ can achieve the squashability by iteratively composing $\rho$: $\rho^n(x)\to a$ for $x\in(a,c)$ and $\rho^n(x)\to b$ for $x\in(c,b)$ as $n\to\infty$ while $\rho^n$ is strictly monotone.
  • Figure 2: Illustration of $f_\text{dec}\circ f_\text{enc}$ when $d_x=2$, $d_y=2$, and $N=2$. $f_\mathrm{enc}$ first encodes each $\mathcal{T}_\nu$ to a bounded interval $f_\mathrm{enc}(\mathcal{T}_\nu)$. Then, $f_\mathrm{dec}$ implements $\delta$-filling curve of $[0,1]^2$, represented by the black curve, to decode each $f_\mathrm{enc}(\mathcal{T}_\nu)$ (colored) that approximates $f^*(\mathcal{T}_\nu)$ (represented by the light gray area).
  • Figure 3: (a) Illustration of a $(1/N)$-filling curve $\tilde{f}$ of $[0,1]^3$. $\tilde{f}$ maps each open interval $\mathcal{I}_\nu$, represented by the colored brackets (left), to be intersected with the corresponding cube of the same color (right). (b) and (c) illustrates our network $\rho$ satisfying the properties of $\phi$ when $N=1$ and $N=3$, respectively.
  • Figure 4: (a) Illustration of a function $g:\mathbb{R}^3 \to\mathbb{R}^2$ that maps sets in a $3$-grid $\mathcal{G}_3$ of size $(2,2,2)$ to distinct sets in $2$-grid $\mathcal{G}_2$ of size $(2,4)$. (b) Illustration of $\psi_c:\mathbb{R}^2\to\mathbb{R}^2$. Here, the first coordinate of $\phi_c(x)$ is approximately $1$ or $0$ depending on whether $x_1$ exceeds $c$ or not while the second coordinate is $x_2$. (c) Illustration of our construction of $f$ when $\mathcal{G}$ is a $2$-grid of size $(3,2)$ and $e_2,e_3>0$ are chosen so that all sets in $\mathcal{G}$ are disjoint in the second coordinate.

Theorems & Definitions (26)

  • Lemma 1: Lemma 4.1 in kidger20
  • Definition 1
  • Theorem 2
  • Lemma 3
  • Lemma 4
  • Lemma 5
  • Lemma 6
  • Lemma 7: Lemmas 21 and 22 in kim24
  • Definition 2
  • Lemma 8
  • ...and 16 more