Table of Contents
Fetching ...

Depth Separations in Neural Networks: Separating the Dimension from the Accuracy

Itay Safran, Daniel Reichman, Paul Valiant

TL;DR

This work proves an exponential size separation between depth 2 and depth 3 neural networks (with real inputs), when approximating a $\mathcal{O}(1)$-Lipschitz target function to constant accuracy, with respect to a distribution with support in the unit ball under the mild assumption that the weights of the depth 2 network are exponentially bounded.

Abstract

We prove an exponential size separation between depth 2 and depth 3 neural networks (with real inputs), when approximating a $\mathcal{O}(1)$-Lipschitz target function to constant accuracy, with respect to a distribution with support in the unit ball, under the mild assumption that the weights of the depth 2 network are exponentially bounded. This resolves an open problem posed in \citet{safran2019depth}, and proves that the curse of dimensionality manifests itself in depth 2 approximation, even in cases where the target function can be represented efficiently using a depth 3 network. Previously, lower bounds that were used to separate depth 2 from depth 3 networks required that at least one of the Lipschitz constant, target accuracy or (some measure of) the size of the domain of approximation scale \emph{polynomially} with the input dimension, whereas in our result these parameters are fixed to be \emph{constants} independent of the input dimension: our parameters are simultaneously optimal. Our lower bound holds for a wide variety of activation functions, and is based on a novel application of a worst- to average-case random self-reducibility argument, allowing us to leverage depth 2 threshold circuits lower bounds in a new domain.

Depth Separations in Neural Networks: Separating the Dimension from the Accuracy

TL;DR

This work proves an exponential size separation between depth 2 and depth 3 neural networks (with real inputs), when approximating a -Lipschitz target function to constant accuracy, with respect to a distribution with support in the unit ball under the mild assumption that the weights of the depth 2 network are exponentially bounded.

Abstract

We prove an exponential size separation between depth 2 and depth 3 neural networks (with real inputs), when approximating a -Lipschitz target function to constant accuracy, with respect to a distribution with support in the unit ball, under the mild assumption that the weights of the depth 2 network are exponentially bounded. This resolves an open problem posed in \citet{safran2019depth}, and proves that the curse of dimensionality manifests itself in depth 2 approximation, even in cases where the target function can be represented efficiently using a depth 3 network. Previously, lower bounds that were used to separate depth 2 from depth 3 networks required that at least one of the Lipschitz constant, target accuracy or (some measure of) the size of the domain of approximation scale \emph{polynomially} with the input dimension, whereas in our result these parameters are fixed to be \emph{constants} independent of the input dimension: our parameters are simultaneously optimal. Our lower bound holds for a wide variety of activation functions, and is based on a novel application of a worst- to average-case random self-reducibility argument, allowing us to leverage depth 2 threshold circuits lower bounds in a new domain.
Paper Structure (20 sections, 10 theorems, 108 equations, 1 figure, 1 table)

This paper contains 20 sections, 10 theorems, 108 equations, 1 figure, 1 table.

Key Result

Theorem 1.1

There exists a sequence of distributions $\left\{\mathcal{D}_{4d}\right\}_{d=1}^{\infty}$ supported in the $4d$-dimensional Euclidean ball, and a sequence of $\mathcal{O}(1)$-Lipschitz functions $f_d:\mathbb{R}^{4d}\to\mathbb{R}$, such that, using any $\mathcal{O}(1)$-Lipschitz or threshold activati

Figures (1)

  • Figure 1: Computing $f_d$ using a depth 3 ReLU network. Subfigure \ref{['fig:a']} plots the function $(x,y)\mapsto\left[ 4x+4y-5 \right]_+ - \left[ 4x+4y-6 \right]_+$, which equals $\mathop{\mathrm{AND}}\nolimits(\mathop{\mathrm{round}}\nolimits(x),\mathop{\mathrm{round}}\nolimits(y))$ for all $x,y\in[0,0.25]\cup[0.75,1]$. Subfigure \ref{['fig:b']} plots the function $x\mapsto t_d(x)$, defined in Example \ref{['ex:relu']}. When composing the latter with a scaled sum of the former, iterating over all pairs of coordinates $x_i,y_i$, we obtain a function that coalesces with $f_d$ on $\mathcal{A}_{4d}$. Best viewed in color.

Theorems & Definitions (19)

  • Theorem 1.1: Informal version of Thm. \ref{['thm:main']}
  • Theorem 2.1
  • Theorem 3.1
  • Theorem 4.1
  • Example 4.2
  • Lemma A.1
  • proof
  • Lemma A.2
  • proof
  • Proposition A.3
  • ...and 9 more