Depth Separations in Neural Networks: Separating the Dimension from the Accuracy

Itay Safran; Daniel Reichman; Paul Valiant

Depth Separations in Neural Networks: Separating the Dimension from the Accuracy

Itay Safran, Daniel Reichman, Paul Valiant

TL;DR

This work proves an exponential size separation between depth 2 and depth 3 neural networks (with real inputs), when approximating a $\mathcal{O}(1)$-Lipschitz target function to constant accuracy, with respect to a distribution with support in the unit ball under the mild assumption that the weights of the depth 2 network are exponentially bounded.

Abstract

We prove an exponential size separation between depth 2 and depth 3 neural networks (with real inputs), when approximating a $\mathcal{O}(1)$-Lipschitz target function to constant accuracy, with respect to a distribution with support in the unit ball, under the mild assumption that the weights of the depth 2 network are exponentially bounded. This resolves an open problem posed in \citet{safran2019depth}, and proves that the curse of dimensionality manifests itself in depth 2 approximation, even in cases where the target function can be represented efficiently using a depth 3 network. Previously, lower bounds that were used to separate depth 2 from depth 3 networks required that at least one of the Lipschitz constant, target accuracy or (some measure of) the size of the domain of approximation scale \emph{polynomially} with the input dimension, whereas in our result these parameters are fixed to be \emph{constants} independent of the input dimension: our parameters are simultaneously optimal. Our lower bound holds for a wide variety of activation functions, and is based on a novel application of a worst- to average-case random self-reducibility argument, allowing us to leverage depth 2 threshold circuits lower bounds in a new domain.

Depth Separations in Neural Networks: Separating the Dimension from the Accuracy

TL;DR

This work proves an exponential size separation between depth 2 and depth 3 neural networks (with real inputs), when approximating a

-Lipschitz target function to constant accuracy, with respect to a distribution with support in the unit ball under the mild assumption that the weights of the depth 2 network are exponentially bounded.

Abstract

We prove an exponential size separation between depth 2 and depth 3 neural networks (with real inputs), when approximating a

-Lipschitz target function to constant accuracy, with respect to a distribution with support in the unit ball, under the mild assumption that the weights of the depth 2 network are exponentially bounded. This resolves an open problem posed in \citet{safran2019depth}, and proves that the curse of dimensionality manifests itself in depth 2 approximation, even in cases where the target function can be represented efficiently using a depth 3 network. Previously, lower bounds that were used to separate depth 2 from depth 3 networks required that at least one of the Lipschitz constant, target accuracy or (some measure of) the size of the domain of approximation scale \emph{polynomially} with the input dimension, whereas in our result these parameters are fixed to be \emph{constants} independent of the input dimension: our parameters are simultaneously optimal. Our lower bound holds for a wide variety of activation functions, and is based on a novel application of a worst- to average-case random self-reducibility argument, allowing us to leverage depth 2 threshold circuits lower bounds in a new domain.

Paper Structure (20 sections, 10 theorems, 108 equations, 1 figure, 1 table)

This paper contains 20 sections, 10 theorems, 108 equations, 1 figure, 1 table.

Introduction
Separating depth 2 from depth 3, continuous, $L_2$ lower bound setting.
Connection between neural networks and threshold circuits.
Progress on the open question posed in safran2019depth.
Setting and Main Result
Preliminaries and notation
Notation and terminology.
Neural networks and threshold circuits.
Formal construction
Assumptions
Main result
Lower Bound
Techniques, and proof sketch of Thm. \ref{['thm:lb']}
Step 1: From a continuous distribution on the unit ball to a discrete distribution on the Boolean hypercube
Step 2: From worst- to average-case using randomization
...and 5 more sections

Key Result

Theorem 1.1

There exists a sequence of distributions $\left\{\mathcal{D}_{4d}\right\}_{d=1}^{\infty}$ supported in the $4d$-dimensional Euclidean ball, and a sequence of $\mathcal{O}(1)$-Lipschitz functions $f_d:\mathbb{R}^{4d}\to\mathbb{R}$, such that, using any $\mathcal{O}(1)$-Lipschitz or threshold activati

Figures (1)

Figure 1: Computing $f_d$ using a depth 3 ReLU network. Subfigure \ref{['fig:a']} plots the function $(x,y)\mapsto\left[ 4x+4y-5 \right]_+ - \left[ 4x+4y-6 \right]_+$, which equals $\mathop{\mathrm{AND}}\nolimits(\mathop{\mathrm{round}}\nolimits(x),\mathop{\mathrm{round}}\nolimits(y))$ for all $x,y\in[0,0.25]\cup[0.75,1]$. Subfigure \ref{['fig:b']} plots the function $x\mapsto t_d(x)$, defined in Example \ref{['ex:relu']}. When composing the latter with a scaled sum of the former, iterating over all pairs of coordinates $x_i,y_i$, we obtain a function that coalesces with $f_d$ on $\mathcal{A}_{4d}$. Best viewed in color.

Theorems & Definitions (19)

Theorem 1.1: Informal version of Thm. \ref{['thm:main']}
Theorem 2.1
Theorem 3.1
Theorem 4.1
Example 4.2
Lemma A.1
proof
Lemma A.2
proof
Proposition A.3
...and 9 more

Depth Separations in Neural Networks: Separating the Dimension from the Accuracy

TL;DR

Abstract

Depth Separations in Neural Networks: Separating the Dimension from the Accuracy

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (19)