Table of Contents
Fetching ...

Depth Separation in Norm-Bounded Infinite-Width Neural Networks

Suzanna Parkinson, Greg Ongie, Rebecca Willett, Ohad Shamir, Nathan Srebro

TL;DR

There are functions that are learnable with sample complexity polynomial in the input dimension by norm-controlled depth-3 ReLU networks, yet are not learnable with sub-exponential sample complexity by norm-controlled depth-2 ReLU networks (with any value for the norm).

Abstract

We study depth separation in infinite-width neural networks, where complexity is controlled by the overall squared $\ell_2$-norm of the weights (sum of squares of all weights in the network). Whereas previous depth separation results focused on separation in terms of width, such results do not give insight into whether depth determines if it is possible to learn a network that generalizes well even when the network width is unbounded. Here, we study separation in terms of the sample complexity required for learnability. Specifically, we show that there are functions that are learnable with sample complexity polynomial in the input dimension by norm-controlled depth-3 ReLU networks, yet are not learnable with sub-exponential sample complexity by norm-controlled depth-2 ReLU networks (with any value for the norm). We also show that a similar statement in the reverse direction is not possible: any function learnable with polynomial sample complexity by a norm-controlled depth-2 ReLU network with infinite width is also learnable with polynomial sample complexity by a norm-controlled depth-3 ReLU network.

Depth Separation in Norm-Bounded Infinite-Width Neural Networks

TL;DR

There are functions that are learnable with sample complexity polynomial in the input dimension by norm-controlled depth-3 ReLU networks, yet are not learnable with sub-exponential sample complexity by norm-controlled depth-2 ReLU networks (with any value for the norm).

Abstract

We study depth separation in infinite-width neural networks, where complexity is controlled by the overall squared -norm of the weights (sum of squares of all weights in the network). Whereas previous depth separation results focused on separation in terms of width, such results do not give insight into whether depth determines if it is possible to learn a network that generalizes well even when the network width is unbounded. Here, we study separation in terms of the sample complexity required for learnability. Specifically, we show that there are functions that are learnable with sample complexity polynomial in the input dimension by norm-controlled depth-3 ReLU networks, yet are not learnable with sub-exponential sample complexity by norm-controlled depth-2 ReLU networks (with any value for the norm). We also show that a similar statement in the reverse direction is not possible: any function learnable with polynomial sample complexity by a norm-controlled depth-2 ReLU network with infinite width is also learnable with polynomial sample complexity by a norm-controlled depth-3 ReLU network.
Paper Structure (25 sections, 27 theorems, 111 equations, 2 figures)

This paper contains 25 sections, 27 theorems, 111 equations, 2 figures.

Key Result

Theorem 1.1

(Depth Separation, Informal) There is a family of functions $f_d:\mathbb{R}^{2d}\rightarrow\mathbb{R}$ that requires exponential (in $d$) sample complexity to learn to within constant error by regularizing the norm in an unbounded width depth-2 ReLU network, but which can be learned with $poly(d,1/\

Figures (2)

  • Figure 1: Visualization of ${\mathcal{A}}^{\theta}_{L}(S)$, ${\mathcal{A}}^{\theta,\alpha}_{L}(S)$, and ${\mathcal{A}}^{*}_{L}(S)$. The red shaded area represents the set of possible values of $\left(\mathscr{L}_{S}\left(f\right), R_{L}(f)\right)$ where $f$ is represented by an $L$-layer network. The red curves form the Pareto frontier $\mathcal{P}_{L}(S)$. Minimizing the population loss $\mathscr{L}_{\mathscr{D}_d}$ over the Pareto frontier yields ${\mathcal{A}}^{*}_{L}(S)$, represented by the star. In green is the vector $[1,\lambda]^\top$ and lines normal to it. These normal lines form level sets of $\mathscr{L}_{S}\left(f\right) + \lambda R_{L}(f)$. Notice the black dot on the Pareto frontier, which represents ${\mathcal{A}}^{\theta}_{L}(S)$. The output of ${\mathcal{A}}^{\theta}_{L}(S)$ corresponds to $\min_{f\in \mathcal{N}_{L}} \mathscr{L}_{S}\left(f\right) + \lambda R_{L}(f).$ The purple shaded region shows the possible outputs of ${\mathcal{A}}^{\theta,\alpha}_{L}(S)$, which are all $\alpha$-close to ${\mathcal{A}}^{\theta}_{L}(S)$.
  • Figure 2: The sawtooth function $\psi_{n}: {\mathbb{R}} \rightarrow [-1,1]$ with $n=4$. The function $\psi_{n}$ has $n$ cycles in $[-1,1]$ and is equal to zero outside $[-1,1]$.

Theorems & Definitions (57)

  • Theorem 1.1
  • Theorem 1.2
  • Remark 2.1
  • Lemma 3.1: daniely2017depth
  • Lemma 3.2
  • Corollary 3.3
  • Definition 4.1
  • Definition 4.2
  • Remark 4.3
  • Theorem 5.1: Depth Separation in Learning
  • ...and 47 more