Deep vs. shallow networks : An approximation theory perspective
Hrushikesh Mhaskar, Tomaso Poggio
TL;DR
This paper provides a theoretical framework to explain why deep networks can outperform shallow ones in function approximation by modeling computation as DAG-based compositional functions. It introduces multiple results across activation types, showing direct approximation rates for ReLU and Gaussian networks and extending these bounds to deep, DAG-structured architectures with rates of the form $n^{-\gamma/d}$, where $d$ is the maximum indegree, highlighting potential advantages when $d \ll q$. A central contribution is the relative-dimension concept, a sparsity-measure that helps explain when deep representations are more efficient than shallow ones. The framework unifies several activation families, provides both direct and converse theorems, and motivates the notion of blessed representations that enable deep nets to bypass the curse of dimensionality, with implications for architecture design and learning efficiency.
Abstract
The paper briefy reviews several recent results on hierarchical architectures for learning from examples, that may formally explain the conditions under which Deep Convolutional Neural Networks perform much better in function approximation problems than shallow, one-hidden layer architectures. The paper announces new results for a non-smooth activation function - the ReLU function - used in present-day neural networks, as well as for the Gaussian networks. We propose a new definition of relative dimension to encapsulate different notions of sparsity of a function class that can possibly be exploited by deep networks but not by shallow ones to drastically reduce the complexity required for approximation and learning.
