Learning Functions: When Is Deep Better Than Shallow
Hrushikesh Mhaskar, Qianli Liao, Tomaso Poggio
TL;DR
This paper addresses whether depth provides a fundamental advantage for learning functions with compositional structure. By modeling deep nets as binary trees, it proves that deep architectures can match shallow accuracy for compositional targets with exponentially fewer parameters and smaller VC-dimension, supported by n-width arguments and optimality considerations. It extends the analysis to Gaussian networks and presents a general, scalable framework for hierarchical, shift-invariant computations, including a VC-dimension comparison that favors depth for the same approximation goals. The results offer a rigorous explanation for the empirical success of deep, multi-scale architectures such as CNNs and ResNets, and provide guidance on when depth is beneficial versus when it is not.
Abstract
While the universal approximation property holds both for hierarchical and shallow networks, we prove that deep (hierarchical) networks can approximate the class of compositional functions with the same accuracy as shallow networks but with exponentially lower number of training parameters as well as VC-dimension. This theorem settles an old conjecture by Bengio on the role of depth in networks. We then define a general class of scalable, shift-invariant algorithms to show a simple and natural set of requirements that justify deep convolutional networks.
