Table of Contents
Fetching ...

Minimum Width of Deep Narrow Networks for Universal Approximation

Xiao-Song Yang, Qi Zhou, Xuan Zhou

TL;DR

This work tackles the problem of determining the minimum width required for universal approximation by deep narrow networks, revealing how the bound $w_{min}$ scales with input and output dimensions and depends on the activation function. The authors deploy a blend of geometric and topological arguments, including a Poincaré-Miranda based approach, to derive lower bounds for injective activations and establish upper bounds for ELU/SELU and ReLU variants, with precise equalities in key dimension regimes. The main theoretical contributions show $w_{min} \leq \max(2d_x+1, d_y)$ for ELU/SELU (and $w_{min}=2d_x+1$ when $d_y=2d_x$) and $d_x+1 \leq w_{min} \leq d_x+d_y$ for LeakyReLU/ELU/CELU/SELU/Softplus, while injective activations yield $w_{min} \ge d_y+\mathbf{1}_{d_x<d_y\leq 2d_x}$. Complementary numerical experiments on rotations and DISK datasets validate the width–depth trade-offs and illustrate the practical implications for designing deep narrow networks. The results provide concrete design guidelines on how wide networks must be to guarantee universal approximation for different activation functions, informing both theoretical understanding and practical network architecture decisions in deep learning.

Abstract

Determining the minimum width of fully connected neural networks has become a fundamental problem in recent theoretical studies of deep neural networks. In this paper, we study the lower bounds and upper bounds of the minimum width required for fully connected neural networks in order to have universal approximation capability, which is important in network design and training. We show that $w_{min}\leq\max(2d_x+1, d_y)$ also holds true for networks with ELU, SELU activation functions, and the upper bound of this inequality is attained when $d_y=2d_x$, where $d_x$, $d_y$ denote the input and output dimensions, respectively. Besides, we show that $d_x+1\leq w_{min}\leq d_x+d_y$ for networks with LeakyReLU, ELU, CELU, SELU, Softplus activation functions, by proving that ReLU activation function can be approximated by these activation functions. In addition, in the case that the activation function is injective or can be uniformly approximated by a sequence of injective functions (e.g., ReLU), we present a new proof of the inequality $w_{min}\ge d_y+\mathbf{1}_{d_x<d_y\leq2d_x}$ by constructing a more intuitive example via a new geometric approach based on Poincaré-Miranda Theorem.

Minimum Width of Deep Narrow Networks for Universal Approximation

TL;DR

This work tackles the problem of determining the minimum width required for universal approximation by deep narrow networks, revealing how the bound scales with input and output dimensions and depends on the activation function. The authors deploy a blend of geometric and topological arguments, including a Poincaré-Miranda based approach, to derive lower bounds for injective activations and establish upper bounds for ELU/SELU and ReLU variants, with precise equalities in key dimension regimes. The main theoretical contributions show for ELU/SELU (and when ) and for LeakyReLU/ELU/CELU/SELU/Softplus, while injective activations yield . Complementary numerical experiments on rotations and DISK datasets validate the width–depth trade-offs and illustrate the practical implications for designing deep narrow networks. The results provide concrete design guidelines on how wide networks must be to guarantee universal approximation for different activation functions, informing both theoretical understanding and practical network architecture decisions in deep learning.

Abstract

Determining the minimum width of fully connected neural networks has become a fundamental problem in recent theoretical studies of deep neural networks. In this paper, we study the lower bounds and upper bounds of the minimum width required for fully connected neural networks in order to have universal approximation capability, which is important in network design and training. We show that also holds true for networks with ELU, SELU activation functions, and the upper bound of this inequality is attained when , where , denote the input and output dimensions, respectively. Besides, we show that for networks with LeakyReLU, ELU, CELU, SELU, Softplus activation functions, by proving that ReLU activation function can be approximated by these activation functions. In addition, in the case that the activation function is injective or can be uniformly approximated by a sequence of injective functions (e.g., ReLU), we present a new proof of the inequality by constructing a more intuitive example via a new geometric approach based on Poincaré-Miranda Theorem.

Paper Structure

This paper contains 27 sections, 16 theorems, 67 equations, 6 figures, 2 tables.

Key Result

Proposition 5

$N_{m, n, k}^{\sigma} = N(\sigma; m, k, \cdots,k , n)$.

Figures (6)

  • Figure 1: Self-intersection structure of $\Phi_\sigma$. The orange polyline represents $g([0, 1])$, the green curves $L_1$ and $L_2$ denote $\Phi_\sigma([0, \frac{1}{5}])$ and $\Phi_\sigma([\frac{4}{5}, 1])$, respectively. Then $L_1$ must intersects with $L_2$ at the red point $P$, which contradicts the property that the network $\Phi_\sigma$ is a topological embedding.
  • Figure 2: The diagram of the geometric transformations of LeakyReLU layers. Subfigure (a), Subfigure (b) and Subfigure (c) adopt the construction of $\Phi_2, \Phi_4, \Phi_6$, respectively, which are mentioned in Appendix \ref{['appendix_Leaky_equivalent']}. The parameters are $\alpha=0.1$, $\beta=0.2$, $\beta_1=\beta^2=0.04$, $\beta_2=\beta^1=0.2$. They achieve the desired accuracy over $K_1=[-3, +\infty), K_2=[-6, +\infty)$ and $K_3=[-9, +\infty)$, respectively.
  • Figure 3: The diagram of the geometric transformations of ELU layers. Subfigure (a), Subfigure (b) and Subfigure (c) adopt the construction of $\Phi_1, \Phi_2, \Phi_3$, respectively, which are mentioned in Appendix \ref{['appendix_variant_ReLU_sim']}. They achieve the desired accuracy over $K_1=[-3, +\infty), K_2=[-6, +\infty)$ and $K_3=[-9, +\infty)$, respectively.
  • Figure 4: Comparison between the target and output sets when using an ELU network with width 4 and depth 4 to approximate the 2-rotation map $rot_2$
  • Figure 5: The digram of the loss functions for the experiment on the DISK dataset.
  • ...and 1 more figures

Theorems & Definitions (22)

  • Definition 1: Architecture of a Neural Network
  • Definition 2: Depth and Width of a NN
  • Definition 3: Activation Function
  • Definition 4: Set of NNs
  • Proposition 5: Equivalence between Sets of NNs
  • Definition 6: Minimum Width
  • Theorem 7: Density of Full-Rank Matrices
  • Theorem 8: UAP of Networks with Full-Rank Weight Matrices
  • Definition 9: Topological Embedding
  • Theorem 10: Poincaré-Miranda Theorem
  • ...and 12 more