Table of Contents
Fetching ...

Deep vs. shallow networks : An approximation theory perspective

Hrushikesh Mhaskar, Tomaso Poggio

TL;DR

This paper provides a theoretical framework to explain why deep networks can outperform shallow ones in function approximation by modeling computation as DAG-based compositional functions. It introduces multiple results across activation types, showing direct approximation rates for ReLU and Gaussian networks and extending these bounds to deep, DAG-structured architectures with rates of the form $n^{-\gamma/d}$, where $d$ is the maximum indegree, highlighting potential advantages when $d \ll q$. A central contribution is the relative-dimension concept, a sparsity-measure that helps explain when deep representations are more efficient than shallow ones. The framework unifies several activation families, provides both direct and converse theorems, and motivates the notion of blessed representations that enable deep nets to bypass the curse of dimensionality, with implications for architecture design and learning efficiency.

Abstract

The paper briefy reviews several recent results on hierarchical architectures for learning from examples, that may formally explain the conditions under which Deep Convolutional Neural Networks perform much better in function approximation problems than shallow, one-hidden layer architectures. The paper announces new results for a non-smooth activation function - the ReLU function - used in present-day neural networks, as well as for the Gaussian networks. We propose a new definition of relative dimension to encapsulate different notions of sparsity of a function class that can possibly be exploited by deep networks but not by shallow ones to drastically reduce the complexity required for approximation and learning.

Deep vs. shallow networks : An approximation theory perspective

TL;DR

This paper provides a theoretical framework to explain why deep networks can outperform shallow ones in function approximation by modeling computation as DAG-based compositional functions. It introduces multiple results across activation types, showing direct approximation rates for ReLU and Gaussian networks and extending these bounds to deep, DAG-structured architectures with rates of the form , where is the maximum indegree, highlighting potential advantages when . A central contribution is the relative-dimension concept, a sparsity-measure that helps explain when deep representations are more efficient than shallow ones. The framework unifies several activation families, provides both direct and converse theorems, and motivates the notion of blessed representations that enable deep nets to bypass the curse of dimensionality, with implications for architecture design and learning efficiency.

Abstract

The paper briefy reviews several recent results on hierarchical architectures for learning from examples, that may formally explain the conditions under which Deep Convolutional Neural Networks perform much better in function approximation problems than shallow, one-hidden layer architectures. The paper announces new results for a non-smooth activation function - the ReLU function - used in present-day neural networks, as well as for the Gaussian networks. We propose a new definition of relative dimension to encapsulate different notions of sparsity of a function class that can possibly be exploited by deep networks but not by shallow ones to drastically reduce the complexity required for approximation and learning.

Paper Structure

This paper contains 16 sections, 6 theorems, 43 equations, 4 figures.

Key Result

Theorem 2.1

Let $\sigma :{\mathbb R}\to {\mathbb R}$ be infinitely differentiable, and not a polynomial on any subinterval of ${\mathbb R}$. (a) For $f\in W_{r,q}^{\hbox{NN}}$ (b) For $f\in W_{H,r,2}^{\hbox{NN}}$

Figures (4)

  • Figure 1: A scalable function. Each layer consists of identical blocks; each block is a function $H_{2}: {\mathbb R}^2 \mapsto {\mathbb R}$. The overall function shown in the figure is $\mathop{\mathrm{\mathbb{R}}}\nolimits^{32} \mapsto {\mathbb R}$
  • Figure 2: A shallow universal network in 8 variables and $N$ units which can approximate a generic function $f(x_1, \cdots, x_8)$. The top node consists of $n$ units and computes the ridge function $\sum_{i=1}^n a_i\sigma({\langle {\mathbf{v}_i}, {\mathbf{x}}\rangle}+t_i)$, with $\mathbf{v}_i, \mathbf{x} \in {\mathbb R}^2$, $a_i, t_i\in{\mathbb R}$.
  • Figure 3: A binary tree hierarchical network in 8 variables, which approximates well functions of the form (\ref{['l-variables']}). Each of the nodes consists of $n$ units and computes the ridge function $\sum_{i=1}^n a_i\sigma({\langle {\mathbf{v}_i}, {\mathbf{x}}\rangle}+t_i)$, with $\mathbf{v}_i, \mathbf{x} \in {\mathbb R}^2$, $a_i, t_i\in{\mathbb R}$. Similar to the shallow network such a hierarchical network can approximate any continuous function; the text proves how it approximates compositional functions better than a shallow network. Shift invariance may additionally hold implying that the weights in each layer are the same. The inset at the top right shows a network similar to ResNets: our results on binary trees apply to this case as well with obvious changes in the constants
  • Figure 4: An example of a $\mathcal{G}$--function ($f^*$ given in (\ref{['gfuncexample']})). The vertices of the DAG $\mathcal{G}$ are denoted by red dots. The black dots represent the input to the various nodes as indicated by the in--edges of the red nodes, and the blue dot indicates the output value of the $\mathcal{G}$--function, $f^*$ in this example.

Theorems & Definitions (6)

  • Theorem 2.1
  • Theorem 3.1
  • Theorem 3.2
  • Theorem 3.3
  • Theorem 4.1
  • Theorem 4.2