Table of Contents
Fetching ...

Deep Learning of Compositional Targets with Hierarchical Spectral Methods

Hugo Tabanelli, Yatin Dandi, Luca Pesce, Florent Krzakala

TL;DR

This work addresses why depth provides a computational advantage for learning structured, high-dimensional targets. It introduces a hierarchical spectral framework that recovers intermediate representations layer by layer in a Gaussian setting, replacing gradient-based training with explicit spectral estimators built from Hermite moments. The main results show a sharp sample complexity separation: a three-layer hierarchical estimator can recover the latent features with $n=O(d^{k+\varepsilon})$, outperforming shallow kernel methods that require $n=O(d^{ ext{deg}})$, and are more efficient than a single-shot to learn a high-degree polynomial. Gaussian equivalence principles underpin the analysis, enabling precise control over spectral gaps and asymptotic Gaussian behavior at each layer. The approach clarifies how depth facilitates progressive reparameterization and modular learning of compositional targets, with potential implications for understanding real-world deep networks and guiding new spectral algorithms for hierarchical data.

Abstract

Why depth yields a genuine computational advantage over shallow methods remains a central open question in learning theory. We study this question in a controlled high-dimensional Gaussian setting, focusing on compositional target functions. We analyze their learnability using an explicit three-layer fitting model trained via layer-wise spectral estimators. Although the target is globally a high-degree polynomial, its compositional structure allows learning to proceed in stages: an intermediate representation reveals structure that is inaccessible at the input level. This reduces learning to simpler spectral estimation problems, well studied in the context of multi-index models, whereas any shallow estimator must resolve all components simultaneously. Our analysis relies on Gaussian universality, leading to sharp separations in sample complexity between two and three-layer learning strategies.

Deep Learning of Compositional Targets with Hierarchical Spectral Methods

TL;DR

This work addresses why depth provides a computational advantage for learning structured, high-dimensional targets. It introduces a hierarchical spectral framework that recovers intermediate representations layer by layer in a Gaussian setting, replacing gradient-based training with explicit spectral estimators built from Hermite moments. The main results show a sharp sample complexity separation: a three-layer hierarchical estimator can recover the latent features with , outperforming shallow kernel methods that require , and are more efficient than a single-shot to learn a high-degree polynomial. Gaussian equivalence principles underpin the analysis, enabling precise control over spectral gaps and asymptotic Gaussian behavior at each layer. The approach clarifies how depth facilitates progressive reparameterization and modular learning of compositional targets, with potential implications for understanding real-world deep networks and guiding new spectral algorithms for hierarchical data.

Abstract

Why depth yields a genuine computational advantage over shallow methods remains a central open question in learning theory. We study this question in a controlled high-dimensional Gaussian setting, focusing on compositional target functions. We analyze their learnability using an explicit three-layer fitting model trained via layer-wise spectral estimators. Although the target is globally a high-degree polynomial, its compositional structure allows learning to proceed in stages: an intermediate representation reveals structure that is inaccessible at the input level. This reduces learning to simpler spectral estimation problems, well studied in the context of multi-index models, whereas any shallow estimator must resolve all components simultaneously. Our analysis relies on Gaussian universality, leading to sharp separations in sample complexity between two and three-layer learning strategies.
Paper Structure (36 sections, 8 theorems, 54 equations, 7 figures, 1 algorithm)

This paper contains 36 sections, 8 theorems, 54 equations, 7 figures, 1 algorithm.

Key Result

Theorem 3.1

Let $\hat{C}_k^{(1)}$ be as defined in Eq. eq:C-ell-k. Then, whp as $d, d_1 \rightarrow \infty$: where $\tilde{O}$ includes polylogarithmic factors. The $\sqrt{d_1}$ scaling accounts for the normalization $\left\lVert A^{(2)}\right\rVert_2 = \mathcal{O}(\frac{1}{\sqrt{d_1}})$.

Figures (7)

  • Figure 1: An illustration of the compositional target functions as defined in section \ref{['sec:model']}
  • Figure 1: Hierarchical spectral learning
  • Figure 2: Learning with hierarchical spectral methods: This plot shows the performance of the hierarchical estimator described in Algorithm \ref{['alg:agnostic-recovery']} when learning the target \ref{['eq:main:simple_target']} with an identity readout $g^\star(x)=x$. In this case, kernel and shallow methods requires $O(d^4)$ samples. Left: Mean Squared Error (MSE) achieved by the labels predictor $\{\hat{y}_\mu\}_{\mu=1}^n$ versus normalized number of samples $\alpha$ for different input dimensions $d = \{80,100,120,140\}$. The latent features' dimension is fixed to $d^\epsilon = \sqrt{d}$. The MSE drops significantly at the theoretically predicted threshold $d = \mathcal{O}(d^{k+\epsilon}) = \mathcal{O}(d^{2.5})$ in agreement with Theorem \ref{['thm:matrix_conc_2']}. Center: Evaluation of the learned representations $\{\widehat{h}^{(1)}_\mu\}_{\mu = 1}^n$ measuring an overlap with the ground truth (Details in Appendix \ref{['sec:app:numerics']}). Similarly to the behaviour of the MSE, the overlap grows significantly at the predicted threshold $d^{2.5}$. Right: Spectrum of the second-order matrix $\hat{C}^{(1)}_2$ in eq. \ref{['eq:main:chat1']} for a fixed $d = 100$ and $\alpha = 3$. The density of eigenvalues presents a clear separation in a supported bulk (noise) plus separate $d^\epsilon = 10$ spikes (signal), separating from the bulk (noise), as predicted by the theory.
  • Figure 3: On the role of $d^\epsilon$: The plot shows the performance of the hierarchical estimator described in Algorithm \ref{['alg:agnostic-recovery']} when learning a modification of target \ref{['eq:main:simple_target']} seen in Fig. \ref{['fig:main_fig_subplots']}. We consider $\epsilon = 1$, therefore, an amount of spikes to learn equal to the ambient dimension $d$. Mean Squared Error (MSE) and Feature overlap are plotted versus normalized number of samples $\alpha$ for different input dimensions $d = \{40,80,100,120\}$. Spectrum size $d=120$.
  • Figure 4: On the role of $g^\star$. Performance of the hierarchical estimator described in Algorithm \ref{['alg:agnostic-recovery']} when learning a modified version of the target \ref{['eq:main:simple_target']}, as in Fig. \ref{['fig:main_fig_subplots']}. We consider the nonlinearity $g^\star = \tanh$. Introducing this additional nonlinearity does not alter the qualitative behavior of the method: once the first-layer features ${\bf{h}}^{(1)}$ are learned, estimating $g^\star$ reduces to a one-dimensional regression problem (see Algorithm \ref{['alg:agnostic-recovery']}). The mean squared error (MSE) and feature overlap are shown as functions of the normalized sample size $\alpha$, for input dimensions $d \in \{40,80,100,120\}$. Spectrum size $d=120$.
  • ...and 2 more figures

Theorems & Definitions (9)

  • Theorem 3.1
  • Remark 3.1: Conjectured Extension
  • Theorem 3.2
  • Lemma 4.1
  • Lemma 4.2: Lemma 2 in wang2023learning
  • Lemma A.1
  • Lemma A.2: Lemma 2 in nualart2005central
  • Lemma A.3
  • Lemma A.4: Lemma F.4 in wen2025does