Table of Contents
Fetching ...

Optimal scaling laws in learning hierarchical multi-index models

Leonardo Defilippis, Florent Krzakala, Bruno Loureiro, Antoine Maillard

TL;DR

This work provides a principled theory for how shallow networks learn hierarchical multi-index targets in a genuinely feature-learning regime. It develops sharp information-theoretic scaling laws for subspace recovery and a data-agnostic spectral estimator that achieves them, revealing sequential feature emergence through phase transitions. It then shows that a two-stage neural network training procedure (spectral initialization followed by ridge readout) attains the same Bayes-optimal rates in excess risk, tying representation discovery directly to predictive performance. The results connect universal scaling behavior with spectral structure and progressive concept learning, offering a rigorous benchmark for understanding learning dynamics in neural networks and guiding future work on SGD-based training and more realistic data-models.

Abstract

In this work, we provide a sharp theory of scaling laws for two-layer neural networks trained on a class of hierarchical multi-index targets, in a genuinely representation-limited regime. We derive exact information-theoretic scaling laws for subspace recovery and prediction error, revealing how the hierarchical features of the target are sequentially learned through a cascade of phase transitions. We further show that these optimal rates are achieved by a simple, target-agnostic spectral estimator, which can be interpreted as the small learning-rate limit of gradient descent on the first-layer weights. Once an adapted representation is identified, the readout can be learned statistically optimally, using an efficient procedure. As a consequence, we provide a unified and rigorous explanation of scaling laws, plateau phenomena, and spectral structure in shallow neural networks trained on such hierarchical targets.

Optimal scaling laws in learning hierarchical multi-index models

TL;DR

This work provides a principled theory for how shallow networks learn hierarchical multi-index targets in a genuinely feature-learning regime. It develops sharp information-theoretic scaling laws for subspace recovery and a data-agnostic spectral estimator that achieves them, revealing sequential feature emergence through phase transitions. It then shows that a two-stage neural network training procedure (spectral initialization followed by ridge readout) attains the same Bayes-optimal rates in excess risk, tying representation discovery directly to predictive performance. The results connect universal scaling behavior with spectral structure and progressive concept learning, offering a rigorous benchmark for understanding learning dynamics in neural networks and guiding future work on SGD-based training and more realistic data-models.

Abstract

In this work, we provide a sharp theory of scaling laws for two-layer neural networks trained on a class of hierarchical multi-index targets, in a genuinely representation-limited regime. We derive exact information-theoretic scaling laws for subspace recovery and prediction error, revealing how the hierarchical features of the target are sequentially learned through a cascade of phase transitions. We further show that these optimal rates are achieved by a simple, target-agnostic spectral estimator, which can be interpreted as the small learning-rate limit of gradient descent on the first-layer weights. Once an adapted representation is identified, the readout can be learned statistically optimally, using an efficient procedure. As a consequence, we provide a unified and rigorous explanation of scaling laws, plateau phenomena, and spectral structure in shallow neural networks trained on such hierarchical targets.
Paper Structure (17 sections, 8 theorems, 105 equations, 3 figures, 1 algorithm)

This paper contains 17 sections, 8 theorems, 105 equations, 3 figures, 1 algorithm.

Key Result

Theorem 3.2

In the setting of Definitions assumption:general_setting, def:data_setting, under Assumption assumption:gen_exp, for $\alpha,m_\star\gg 1$, the Bayes-optimal mean-squared error satisfies Moreover, the $k$-critical threshold of the Bayes estimator satisfies where $x_+ \coloneqq \max(0, x)$.

Figures (3)

  • Figure 1: Weighted mean square error ${\rm MSE}_\gamma$ -- see Def. \ref{['def:weighted_MSE']} -- of the spectral estimator of Def. \ref{['def:spectral_estimator']} with preprocessing function ${\mathcal{T}}(y)=y/(1+|y|)$, averaged over $70$ instances. The target is given by the hierarchical multi-index model \ref{['assumption:general_setting']}, with $g_k(z)=g(z)\; \forall k$, stated in the legend, and $a_k^\star \propto k^{-\gamma}$, $\gamma = 1.3$. The covariates dimension is $d = 1000$, the feature space dimension is $m_\star = 10$.
  • Figure 2: Empirical spectrum density of the matrix ${\boldsymbol T}$ defined in eq. \ref{['eq:def:spectral_method_matrix']}, with preprocessing function ${\mathcal{T}}(y)=y/(1+|y|)$, at different sample complexities, highlighting the sequential emergence of concepts as the sample size increases. The target is given by a hierarchical multi-index model \ref{['assumption:general_setting']}, with $g_k(z) = \frac{1}{2}{\rm He}_2(z)+\frac{1}{2\cdot4!}{\rm He}_4(z)\;\forall k$ and $a_k\propto k^{-\gamma}$, $\gamma = 1.3$. The covariates dimension is $d = 1000$, the feature space dimension is $m_\star =20$. ( top left) $\alpha = 5$, ( top right) $\alpha = 164$, ( bottom left) $\alpha = 611$, ( bottom right) $\alpha = 1638$.
  • Figure 3: Empirical spectrum of ${\boldsymbol T}$ eq. \ref{['eq:def:spectral_method_matrix']} with preprocessing ${\mathcal{T}}(y)=y/(1+|y|)$, for a hierarchical multi-index model with $g_k(z)=\frac{1}{2}{\rm He}_2(z)+\frac{1}{2\cdot4!}{\rm He}_4(z)$ and $\gamma = 1.3$. The covariates dimension is $d = 1000$, while the feature space dimension is $m_\star = 20$. The figure illustrates the change in scale of the eigenvalue gaps, transitioning from the informative spikes ($\Theta_d(1)$) to the uninformative bulk ($o_d(1)$). This behavior forms the basis of the selection method described in Def. \ref{['def:spectral_estimator']}.

Theorems & Definitions (24)

  • Definition 2.1: Hierarchical multi-index model
  • Definition 2.2
  • Remark 2.4
  • Definition 2.5: Matrix-MSE
  • Definition 2.6: $k-$critical threshold
  • Definition 2.7: Weighted MSE
  • Definition 3.1: Optimal MSE
  • Theorem 3.2: Optimal scaling-laws
  • Definition 3.3: Spectral estimator
  • Remark 3.5: GD interpretation of spectral method
  • ...and 14 more