Table of Contents
Fetching ...

Learning Functions: When Is Deep Better Than Shallow

Hrushikesh Mhaskar, Qianli Liao, Tomaso Poggio

TL;DR

This paper addresses whether depth provides a fundamental advantage for learning functions with compositional structure. By modeling deep nets as binary trees, it proves that deep architectures can match shallow accuracy for compositional targets with exponentially fewer parameters and smaller VC-dimension, supported by n-width arguments and optimality considerations. It extends the analysis to Gaussian networks and presents a general, scalable framework for hierarchical, shift-invariant computations, including a VC-dimension comparison that favors depth for the same approximation goals. The results offer a rigorous explanation for the empirical success of deep, multi-scale architectures such as CNNs and ResNets, and provide guidance on when depth is beneficial versus when it is not.

Abstract

While the universal approximation property holds both for hierarchical and shallow networks, we prove that deep (hierarchical) networks can approximate the class of compositional functions with the same accuracy as shallow networks but with exponentially lower number of training parameters as well as VC-dimension. This theorem settles an old conjecture by Bengio on the role of depth in networks. We then define a general class of scalable, shift-invariant algorithms to show a simple and natural set of requirements that justify deep convolutional networks.

Learning Functions: When Is Deep Better Than Shallow

TL;DR

This paper addresses whether depth provides a fundamental advantage for learning functions with compositional structure. By modeling deep nets as binary trees, it proves that deep architectures can match shallow accuracy for compositional targets with exponentially fewer parameters and smaller VC-dimension, supported by n-width arguments and optimality considerations. It extends the analysis to Gaussian networks and presents a general, scalable framework for hierarchical, shift-invariant computations, including a VC-dimension comparison that favors depth for the same approximation goals. The results offer a rigorous explanation for the empirical success of deep, multi-scale architectures such as CNNs and ResNets, and provide guidance on when depth is beneficial versus when it is not.

Abstract

While the universal approximation property holds both for hierarchical and shallow networks, we prove that deep (hierarchical) networks can approximate the class of compositional functions with the same accuracy as shallow networks but with exponentially lower number of training parameters as well as VC-dimension. This theorem settles an old conjecture by Bengio on the role of depth in networks. We then define a general class of scalable, shift-invariant algorithms to show a simple and natural set of requirements that justify deep convolutional networks.

Paper Structure

This paper contains 10 sections, 5 theorems, 20 equations, 3 figures.

Key Result

Theorem 1

Let $\sigma :\mathop{\mathrm{\mathbb{R}}}\nolimits\to \mathop{\mathrm{\mathbb{R}}}\nolimits$ be infinitely differentiable, and not a polynomial on any subinterval of $\mathop{\mathrm{\mathbb{R}}}\nolimits$. (a) For $f\in W_{r,d}^{\hbox{NN}}$ (b) For $f\in W_{H,r,d}^{\hbox{NN}}$

Figures (3)

  • Figure 1: a) A shallow universal network in 8 variables and $N$ units which can approximate a generic function $f(x_1, \cdots, x_8)$. b) A binary tree hierarchical network in 8 variables, which approximates well functions of the form $f(x_1, \cdots, x_8) = h_3(h_{21}(h_{11} (x_1, x_2), h_{12}(x_3, x_4)), h_{22}(h_{13}(x_5, x_6), h_{14}(x_7, x_8)))$. Each of the nodes in b) consists of $n$ ReLU units and computes the ridge function (Pinkus1999) $\sum_{i=1}^n a_i(\left\langle{\mathbf{v}_i},{\mathbf{x}}\right\rangle+t_i)_+$, with $\mathbf{v}_i, \mathbf{x} \in \mathop{\mathrm{\mathbb{R}}}\nolimits^2$, $a_i, t_i\in\mathop{\mathrm{\mathbb{R}}}\nolimits$. Each term, that is each unit in the node, corresponds to a "channel". Similar to the shallow network a hierarchical network as in b) can approximate any continuous function; the text proves how it approximates a compositional functions better than a shallow network. No invariance is assumed here.
  • Figure 2: A scalable operator. Each layer consists of identical blocks; each block is an operator $H_2: \mathop{\mathrm{\mathbb{R}}}\nolimits^2 \mapsto R$
  • Figure 3: A sparse trigonometric function $f(x)=2(2\cos^2(x)-1)^2-1$ with one input variable is learned in a regression set-up using standard deep networks with 1, 2 or 3 hidden layers. In the 1 hidden layer setting, 24, 48, 72, 128 and 256 hidden units were tried. With 2 hidden layers, 12, 24 and 36 units per layer were tried. With 3 hidden layers, 8, 16 and 24 units per layer were tried. Each of the above settings was repeated 5 times, reporting the lowest test error. Mean squared error (MSE) was used as the objective function; the y axes in the above figures are the square root of the testing MSE. For the experiments with 2 and 3 hidden layers, batch normalization ioffe2015batch was used between every two hidden layers. 60k training and 60k testing samples were drawn from a uniform distribution over $[-2\pi, 2\pi]$. The training process consisted of 2000 passes through the entire training data with mini batches of size 3000. Stochastic gradient descent with momentum 0.9 and learning rate 0.0001 was used. Implementations were based on MatConvNet vedaldi2015matconvnet. Same data points are plotted in 2 sub-figures with x axes being number of units and parameters, respectively. Note that with the input being 1-D, the number of parameters of a shallow network scales slowly with respect to the number of units, giving a shallow network some advantages in the right sub-figure. Although not shown here, the training errors are very similar to those of testing. The advantage of deep networks is expected to increase with increasing dimensionality of the function. Even in this simple case the solution found by SGD are almost certain to be suboptimal. Thus the figure cannot be taken as fully reflecting the theoretical results of this paper.

Theorems & Definitions (6)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Definition 1