Table of Contents
Fetching ...

Representation Benefits of Deep Feedforward Networks

Matus Telgarsky

TL;DR

This work studies depth versus width in neural networks with ReLU activations by constructing a family of classification tasks parameterized by $k$, where $n=2^k$ samples on $[0,1]$ with alternating labels yield an exponential separation: shallow networks with $m \le 2^{(k-3)/l - 1}$ nodes per layer cannot achieve zero training error, while a deep network with $2$ nodes per layer across $2k$ layers (or a $3$-node recurrent network iterated $k$ times) can achieve zero error. The key techniques rely on sawtooth function counts and a mirror-map $f_m$ to realize an exact fit, plus a refined $n$-alternating-point problem that sharpens the bounds. The results formalize an exponential-depth advantage in expressive power for finite data, connecting to classical circuit complexity and VC-dimension insights, and highlighting how depth can dramatically reduce necessary resources for exact representations.

Abstract

This note provides a family of classification problems, indexed by a positive integer $k$, where all shallow networks with fewer than exponentially (in $k$) many nodes exhibit error at least $1/6$, whereas a deep network with 2 nodes in each of $2k$ layers achieves zero error, as does a recurrent network with 3 distinct nodes iterated $k$ times. The proof is elementary, and the networks are standard feedforward networks with ReLU (Rectified Linear Unit) nonlinearities.

Representation Benefits of Deep Feedforward Networks

TL;DR

This work studies depth versus width in neural networks with ReLU activations by constructing a family of classification tasks parameterized by , where samples on with alternating labels yield an exponential separation: shallow networks with nodes per layer cannot achieve zero training error, while a deep network with nodes per layer across layers (or a -node recurrent network iterated times) can achieve zero error. The key techniques rely on sawtooth function counts and a mirror-map to realize an exact fit, plus a refined -alternating-point problem that sharpens the bounds. The results formalize an exponential-depth advantage in expressive power for finite data, connecting to classical circuit complexity and VC-dimension insights, and highlighting how depth can dramatically reduce necessary resources for exact representations.

Abstract

This note provides a family of classification problems, indexed by a positive integer , where all shallow networks with fewer than exponentially (in ) many nodes exhibit error at least , whereas a deep network with 2 nodes in each of layers achieves zero error, as does a recurrent network with 3 distinct nodes iterated times. The proof is elementary, and the networks are standard feedforward networks with ReLU (Rectified Linear Unit) nonlinearities.

Paper Structure

This paper contains 7 sections, 6 theorems, 8 equations, 2 figures.

Key Result

theorem 1.1

Let positive integer $k$, number of layers $l$, and number of nodes per layer $m$ be given with $m \leq 2^{(k-3)/l - 1}$. Then there exists a collection of $n:= 2^k$ points $((x_i,y_i))_{i=1}^n$ with $x_i\in [0,1]$ and $y\in \{0,1\}$ such that

Figures (2)

  • Figure 1: The $3$-ap.
  • Figure 2: $f_{\textup{m}}$, $f_{\textup{m}}^2$, and $f_{\textup{m}}^3$.

Theorems & Definitions (10)

  • theorem 1.1
  • theorem 1.2
  • lemma 1
  • lemma 2
  • proof
  • lemma 3
  • proof : Proof of \ref{['fact:sawtooth_props']}
  • proof : Proof of \ref{['fact:sawtooth_props:2']}
  • lemma 4
  • proof : Proof of \ref{['fact:fmk']}