Table of Contents
Fetching ...

The Power of Depth for Feedforward Neural Networks

Ronen Eldan, Ohad Shamir

TL;DR

The paper proves a depth separation result for feedforward neural networks of bounded size: a simple radial function in $\mathbb{R}^d$ expressible by a polynomial-width 3-layer network cannot be approximated by any 2-layer network unless the width grows exponentially with the dimension. The authors construct a hard radial target via a random shell-based function and analyze it using Fourier methods, showing that 2-layer nets have frequency-support constraints that prevent small-width approximation, while a 3-layer construction can exploit the radial structure by first computing $\|\mathbf{x}\|^2$ and then applying a univariate map. The proof hinges on a density $\mu$ derived from the ball's Fourier transform and a detailed sequence of lemmas about high-frequency mass and Lipschitz approximability. Overall, the work formalizes that depth confers exponential advantages in expressivity for standard networks, under broad activation assumptions, and provides explicit width bounds for the 3-layer construction.

Abstract

We show that there is a simple (approximately radial) function on $\reals^d$, expressible by a small 3-layer feedforward neural networks, which cannot be approximated by any 2-layer network, to more than a certain constant accuracy, unless its width is exponential in the dimension. The result holds for virtually all known activation functions, including rectified linear units, sigmoids and thresholds, and formally demonstrates that depth -- even if increased by 1 -- can be exponentially more valuable than width for standard feedforward neural networks. Moreover, compared to related results in the context of Boolean functions, our result requires fewer assumptions, and the proof techniques and construction are very different.

The Power of Depth for Feedforward Neural Networks

TL;DR

The paper proves a depth separation result for feedforward neural networks of bounded size: a simple radial function in expressible by a polynomial-width 3-layer network cannot be approximated by any 2-layer network unless the width grows exponentially with the dimension. The authors construct a hard radial target via a random shell-based function and analyze it using Fourier methods, showing that 2-layer nets have frequency-support constraints that prevent small-width approximation, while a 3-layer construction can exploit the radial structure by first computing and then applying a univariate map. The proof hinges on a density derived from the ball's Fourier transform and a detailed sequence of lemmas about high-frequency mass and Lipschitz approximability. Overall, the work formalizes that depth confers exponential advantages in expressivity for standard networks, under broad activation assumptions, and provides explicit width bounds for the 3-layer construction.

Abstract

We show that there is a simple (approximately radial) function on , expressible by a small 3-layer feedforward neural networks, which cannot be approximated by any 2-layer network, to more than a certain constant accuracy, unless its width is exponential in the dimension. The result holds for virtually all known activation functions, including rectified linear units, sigmoids and thresholds, and formally demonstrates that depth -- even if increased by 1 -- can be exponentially more valuable than width for standard feedforward neural networks. Moreover, compared to related results in the context of Boolean functions, our result requires fewer assumptions, and the proof techniques and construction are very different.

Paper Structure

This paper contains 18 sections, 18 theorems, 137 equations, 2 figures.

Key Result

Theorem 1

Suppose the activation function $\sigma(\cdot)$ satisfies assumption assumption with constant $c_{\sigma}$, as well as assumption assumption2. Then there exist universal constants $c,C>0$ such that the following holds: For every dimension $d>C$, there is a probability measure $\mu$ on $\mathbb R^d$

Figures (2)

  • Figure 1: The left figure represents $\varphi(\mathbf{x})$ in $d=2$ dimensions. The right figure represents a cropped and re-scaled version, to better show the oscillations of $\varphi$ beyond the big origin-centered bump. The density of the probability measure $\mu$ is defined as $\varphi^2(\cdot)$
  • Figure 2: Bessel function of the first kind, $J_{20}(\cdot)$

Theorems & Definitions (34)

  • Theorem 1
  • Remark 1: Activation function
  • Remark 2: Constraints on the parameters
  • Remark 3: Properties of $g$
  • Lemma 1
  • Lemma 2
  • Definition 1
  • Lemma 3
  • Lemma 4
  • Lemma 5
  • ...and 24 more