The Computational Advantage of Depth: Learning High-Dimensional Hierarchical Functions with Gradient Descent
Yatin Dandi, Luca Pesce, Lenka Zdeborová, Florent Krzakala
TL;DR
The paper investigates the computational benefits of depth in learning high-dimensional hierarchical functions by gradient descent, introducing SIGHT and MIGHT targets to model compositional hierarchies. It shows that depth enables a progressive coarse-graining of representations, reducing the effective input dimensionality from $d$ to $d^{\varepsilon_1}$, then further down across layers, yielding sharp sample-complexity thresholds. The main theoretical result demonstrates that a three-layer network trained in a layerwise fashion can first recover the first-layer weights $W^{\star}$ with $n_1=\tilde{O}(d^{\varepsilon_1+1})$ samples, then recover the non-linear feature $h^{\star}$ with $n_2=\tilde{O}(d^{k\varepsilon_1})$, and finally fit the target with $\tilde{O}(1)$ samples, with extensions to deeper MIGHT targets. Numerical experiments corroborate the theory, showing depth-enabled feature learning and improved generalization under realistic training procedures, and the authors propose a general conjecture on hierarchical learning via a Compositional Information Exponent that would generalize these results to broader settings.
Abstract
Understanding the advantages of deep neural networks trained by gradient descent (GD) compared to shallow models remains an open theoretical challenge. In this paper, we introduce a class of target functions (single and multi-index Gaussian hierarchical targets) that incorporate a hierarchy of latent subspace dimensionalities. This framework enables us to analytically study the learning dynamics and generalization performance of deep networks compared to shallow ones in the high-dimensional limit. Specifically, our main theorem shows that feature learning with GD successively reduces the effective dimensionality, transforming a high-dimensional problem into a sequence of lower-dimensional ones. This enables learning the target function with drastically less samples than with shallow networks. While the results are proven in a controlled training setting, we also discuss more common training procedures and argue that they learn through the same mechanisms.
