Table of Contents
Fetching ...

The Computational Advantage of Depth: Learning High-Dimensional Hierarchical Functions with Gradient Descent

Yatin Dandi, Luca Pesce, Lenka Zdeborová, Florent Krzakala

TL;DR

The paper investigates the computational benefits of depth in learning high-dimensional hierarchical functions by gradient descent, introducing SIGHT and MIGHT targets to model compositional hierarchies. It shows that depth enables a progressive coarse-graining of representations, reducing the effective input dimensionality from $d$ to $d^{\varepsilon_1}$, then further down across layers, yielding sharp sample-complexity thresholds. The main theoretical result demonstrates that a three-layer network trained in a layerwise fashion can first recover the first-layer weights $W^{\star}$ with $n_1=\tilde{O}(d^{\varepsilon_1+1})$ samples, then recover the non-linear feature $h^{\star}$ with $n_2=\tilde{O}(d^{k\varepsilon_1})$, and finally fit the target with $\tilde{O}(1)$ samples, with extensions to deeper MIGHT targets. Numerical experiments corroborate the theory, showing depth-enabled feature learning and improved generalization under realistic training procedures, and the authors propose a general conjecture on hierarchical learning via a Compositional Information Exponent that would generalize these results to broader settings.

Abstract

Understanding the advantages of deep neural networks trained by gradient descent (GD) compared to shallow models remains an open theoretical challenge. In this paper, we introduce a class of target functions (single and multi-index Gaussian hierarchical targets) that incorporate a hierarchy of latent subspace dimensionalities. This framework enables us to analytically study the learning dynamics and generalization performance of deep networks compared to shallow ones in the high-dimensional limit. Specifically, our main theorem shows that feature learning with GD successively reduces the effective dimensionality, transforming a high-dimensional problem into a sequence of lower-dimensional ones. This enables learning the target function with drastically less samples than with shallow networks. While the results are proven in a controlled training setting, we also discuss more common training procedures and argue that they learn through the same mechanisms.

The Computational Advantage of Depth: Learning High-Dimensional Hierarchical Functions with Gradient Descent

TL;DR

The paper investigates the computational benefits of depth in learning high-dimensional hierarchical functions by gradient descent, introducing SIGHT and MIGHT targets to model compositional hierarchies. It shows that depth enables a progressive coarse-graining of representations, reducing the effective input dimensionality from to , then further down across layers, yielding sharp sample-complexity thresholds. The main theoretical result demonstrates that a three-layer network trained in a layerwise fashion can first recover the first-layer weights with samples, then recover the non-linear feature with , and finally fit the target with samples, with extensions to deeper MIGHT targets. Numerical experiments corroborate the theory, showing depth-enabled feature learning and improved generalization under realistic training procedures, and the authors propose a general conjecture on hierarchical learning via a Compositional Information Exponent that would generalize these results to broader settings.

Abstract

Understanding the advantages of deep neural networks trained by gradient descent (GD) compared to shallow models remains an open theoretical challenge. In this paper, we introduce a class of target functions (single and multi-index Gaussian hierarchical targets) that incorporate a hierarchy of latent subspace dimensionalities. This framework enables us to analytically study the learning dynamics and generalization performance of deep networks compared to shallow ones in the high-dimensional limit. Specifically, our main theorem shows that feature learning with GD successively reduces the effective dimensionality, transforming a high-dimensional problem into a sequence of lower-dimensional ones. This enables learning the target function with drastically less samples than with shallow networks. While the results are proven in a controlled training setting, we also discuss more common training procedures and argue that they learn through the same mechanisms.

Paper Structure

This paper contains 69 sections, 38 theorems, 219 equations, 10 figures, 1 algorithm.

Key Result

Theorem 1

Let $f^\star(\mathbf{x})$ be as in Eq. eq:3layer_target with $\varepsilon_1 \in (0,1)$ and consider a three-layer model: with $W_1 \in \mathbb{R}^{p_1 \times d}$, $W_2 \in \mathbb{R}^{p_2 \times p_1}, \mathbf w_3 \in \mathbb{R}^{p_3}$. Let $\mathcal{L}_{c}(\theta)$ denote the correlation loss defined as $\mathcal{L}_{cl}(\theta) \coloneqq -\hat{f}_\theta(\mathbf{x}) f^\star(\mathbf{x})$. Under A

Figures (10)

  • Figure 1: SIGHT and MIGHT targets: Illustration of Single and Multi Index Gaussian Hierarchical Targets, i.e., SIGHT in eq. \ref{['eq:3layer_target-reduced']} and MIGHT in eq. \ref{['eq:3layer_target_might']}. Left: A SIGHT function. Here we first go from ${\bf x} \in {\mathbb R}^d$ to ${\bf z} \in {\mathbb R}^{d^{\varepsilon}}$. After applying the polynomial transformation pointwise (not shown), this is projected to create a scalar $h^\star \in {\mathbb R}$. One can then output the label $y=g^\star(h^\star)$. Right: A MIGHT function. Again, we go from ${\bf x} \in {\mathbb R}^d$ to ${\bf z} \in {\mathbb R}^{d^{\varepsilon}}$. After applying the polynomial transformation pointwise, we finally projecte on two values $h_{4,1}^\star$ and $h_{4,2}^\star$, from which we create $y$ as a two-index function $y=g^\star(h_{4,1}^\star,h_{4,2}^\star)$.
  • Figure 2: Deep SIGHT and MIGHT: Illustration of deep target functions. Left: A SIGHT function with depth $L=3$. Here we first go from ${\bf x} \in {\mathbb R}^d$ to ${\bf h}_1 \in {\mathbb R}^{d^{\varepsilon_1}}$. After applying the polynomial transformation pointwise (not shown), we now divide ${\bf h}_1$ into $d^{\varepsilon_2}$ blocks of sizes $d^{\varepsilon_1-\varepsilon_2}$. Each of these blocks is projected to create one of the components of ${\bf h}_1 \in {\mathbb R}^{d^{\varepsilon_2}}$. After another polynomial transformation (not shown) we finally project to a single value $h_3^\star$. We can then output the label $y=g^\star(h_3^\star)$. Right: A MIGHT function with depth $L=4$. Again, we go from ${\bf x} \in {\mathbb R}^d$ to ${\bf h}_1 \in {\mathbb R}^{d^{\varepsilon_1}}$. After applying the polynomial transformation pointwise (not shown), we now divide ${\bf h}_1$ into $d^{\varepsilon_2}$ blocks of sizes $d^{\varepsilon_1-\varepsilon_2}$. Each of these blocks is projected to create one of the components of ${\bf h}_2\in {\mathbb R}^{d^{\varepsilon_2}}$. We repeat this operation: we further divide ${\bf h}_2$ into $d^{\varepsilon_3}$ blocks of sizes $d^{\varepsilon_2-\varepsilon_3}$ and each of these blocs is projected to create one of the components of ${\bf h}_3 \in {\mathbb R}^{d^{\varepsilon_3}}$. After another polynomial transformation (not shown) we finally project on two values $h_{4,1}^\star$ and $h_{4,2}^\star$ and create $y$ as a two-index function $y=g^\star(h_{4,1}^\star,h_{4,2}^\star)$.
  • Figure 3: An illustration of the phase transitions in learning SIGHT according to the main Theorem \ref{['thm:main_theorem']} denoting the computational advantage of depth for two different target model: (a) generic shallow SIGHT function (eq. \ref{['eq:3layer_target']}) and (b) the example in eq. \ref{['main-example']}.
  • Figure 4: Numerical simulation: Generalization error versus $\kappa = \log{n}/\log{d}$ for $f^\star(\mathbf{x}) = \tanh( 3 {\mathbf{a}^{\star} \cdot \, P_3(W^\star \mathbf{x})) }/{\sqrt{d^{\varepsilon_1=1/2}}})$ with different training protocols: (Top) kernel ridge regression (orange points) only beats the random performance (purple solid line) starting from $n=d + (d-1)d/2$, and is limited to quadratic approximation (orange line). $2$-layer net (green points), instead, starts to learn at $\kappa=1.5$ (black dashed line) and can beat the quadratic limit (asymptotics is given by the green line). 3-layer net trained with layerwise training (blue markers) not only learn at $\kappa=1.5$ (vertical line). but also surpasses the best possible 2-layer net error, illustrating the advantage of depth; (Bottom) comparison of layerwise training (blue) with joint training (red) of all the layers of a 3-layer net with standard backpropagation.
  • Figure 5: Visualizing Feature Learning: The Frobenius norm of the overlaps $M_h, M_W$ (Def. \ref{['def:sufficient_stat']}), respectively on the top and bottom panel, as a function of the sample complexity $\kappa = \frac{\log n}{\log d}$ for three-layer networks trained with the protocol described in Theorem \ref{['thm:main_theorem']} (blue circles) and standard backpropagation (red squares). Following Theorem \ref{['thm:main_theorem']}, the behavior sharply changes around $\kappa = 1.5$ (vertical dashed line) where feature learning in both layers arises (same setting as in Fig. \ref{['fig:gen_error_fig1']}).
  • ...and 5 more figures

Theorems & Definitions (58)

  • Theorem 1: Informal
  • Theorem 2
  • Definition 1: Compositional Information Exponent
  • Definition 2
  • Theorem 3
  • Definition 3
  • Proposition 1
  • Proposition 2
  • Lemma 1: Non-asymptotic CLT -bound
  • Definition 4: Hermite decomposition
  • ...and 48 more