Table of Contents
Fetching ...

The Nuclear Route: Sharp Asymptotics of ERM in Overparameterized Quadratic Networks

Vittorio Erba, Emanuele Troiani, Lenka Zdeborová, Florent Krzakala

TL;DR

The paper analyzes empirical risk minimization for over-parameterized two-layer neural networks with quadratic activations trained on Gaussian data, revealing that L2 regularization induces a nuclear-norm penalty in an equivalent PSD matrix-estimation problem. Using approximate message passing and Gaussian universality, it derives sharp closed-form limits for training/test errors and the spectrum of the learned weights in the high-dimensional regime, showing that learnability depends on the target’s spectral width κ^* and the extent of over-parameterization κ. Key contributions include exact interpolation and strong-recovery thresholds, a detailed learning-curve description, and a characterization of how over-parameterization can preserve performance even when the width greatly exceeds data requirements. The findings illuminate the deep connection between low-rank matrix sensing and non-linear learning in quadratic networks, bridging spin-glass intuition, convex optimization, and matrix factorization theory with precise asymptotic results.

Abstract

We study the high-dimensional asymptotics of empirical risk minimization (ERM) in over-parametrized two-layer neural networks with quadratic activations trained on synthetic data. We derive sharp asymptotics for both training and test errors by mapping the $\ell_2$-regularized learning problem to a convex matrix sensing task with nuclear norm penalization. This reveals that capacity control in such networks emerges from a low-rank structure in the learned feature maps. Our results characterize the global minima of the loss and yield precise generalization thresholds, showing how the width of the target function governs learnability. This analysis bridges and extends ideas from spin-glass methods, matrix factorization, and convex optimization and emphasizes the deep link between low-rank matrix sensing and learning in quadratic neural networks.

The Nuclear Route: Sharp Asymptotics of ERM in Overparameterized Quadratic Networks

TL;DR

The paper analyzes empirical risk minimization for over-parameterized two-layer neural networks with quadratic activations trained on Gaussian data, revealing that L2 regularization induces a nuclear-norm penalty in an equivalent PSD matrix-estimation problem. Using approximate message passing and Gaussian universality, it derives sharp closed-form limits for training/test errors and the spectrum of the learned weights in the high-dimensional regime, showing that learnability depends on the target’s spectral width κ^* and the extent of over-parameterization κ. Key contributions include exact interpolation and strong-recovery thresholds, a detailed learning-curve description, and a characterization of how over-parameterization can preserve performance even when the width greatly exceeds data requirements. The findings illuminate the deep connection between low-rank matrix sensing and non-linear learning in quadratic networks, bridging spin-glass intuition, convex optimization, and matrix factorization theory with precise asymptotic results.

Abstract

We study the high-dimensional asymptotics of empirical risk minimization (ERM) in over-parametrized two-layer neural networks with quadratic activations trained on synthetic data. We derive sharp asymptotics for both training and test errors by mapping the -regularized learning problem to a convex matrix sensing task with nuclear norm penalization. This reveals that capacity control in such networks emerges from a low-rank structure in the learned feature maps. Our results characterize the global minima of the loss and yield precise generalization thresholds, showing how the width of the target function governs learnability. This analysis bridges and extends ideas from spin-glass methods, matrix factorization, and convex optimization and emphasizes the deep link between low-rank matrix sensing and learning in quadratic neural networks.

Paper Structure

This paper contains 44 sections, 7 theorems, 100 equations, 4 figures.

Key Result

Theorem 1

Consider the setting of Section sec:setting with $\kappa \geq 1$. Define $\tilde{\lambda} = \sqrt{\kappa} \lambda$ and $\mu^*_\delta = \mu^* \boxplus \mu_{\rm s.c., \delta}$, where $\boxplus$ is the free convolution and $\mu_{\rm s.c., \delta} = \sqrt{4 \delta^2 -x^2}/(2\pi \delta^2)$ the semicircle Then, for all values of $\alpha, \kappa^*, \lambda > 0$, $\Delta \geq 0$ and $\kappa \geq 1$ any gl

Figures (4)

  • Figure 1: Left: Test error of simulations of vanilla GD (crosses, error bars are the standard deviation over 16 realizations of the target/training set at $d=300$) compared with the results of Theorem \ref{['res:over-parametrized-stud']} (lines) as a function of the number of samples $n = \alpha d^2$, noiseless case $\Delta = 0$, $\kappa^*=0.2$. We observe a perfect match, particularly striking in the regime of small test error. The purple line is the Bayes-optimal performance maillard_bayes-optimal_2024. Right: Test error of simulations of GD run with LBFGS on \ref{['eq:erm']} (yellow dots, $d=300$) and of a convex solver run on the equivalent convex matrix problem \ref{['eq:erm-eff']} (blue dots $d=50$, purple $d=100$ dots), for $\Delta = 0.5$, $\kappa^*=0.2$, and $\lambda = 0.02$ and as a function of the number of samples $n = \alpha d^2$. Error bars are the standard deviation over 16 realizations of the target/training set, compared with the result of Theorem \ref{['res:over-parametrized-stud']} (gray line).
  • Figure 2: Spectra of the singular values of $\hat{W}/\sqrt[4]{md}$ for $\kappa^*=0.2$, $\Delta=0.5$, $\lambda = 0.02$ and several values of $\alpha$. The red line is the singular value density of the target $2x\mu^*(x^2)$, the blue line is the density predicted by \ref{['eq:res-spectrum']}. The histogram in gray is computed on the singular values of $16$ runs of LBFGS on experiments with $d=400$.
  • Figure 3: (Left) The test error of any global minimum of \ref{['eq:erm']} (Theorem \ref{['res:over-parametrized-stud']}) in the noiseless case $\Delta = 0$, $\kappa^*=0.2$ for finite regularization $\lambda = 0.4$ (blue line), in the limit $\lambda \to 0^+$ (yellow line) and for optimal regularization (dashed line). We compare with the Bayes-optimal performance maillard_bayes-optimal_2024 (purple line), and highlight the strong recovery threshold (vertical gray line, see Corollary \ref{['res:strong']}). (Center, Right) The test and train loss \ref{['eq:erm']} in the noisy case $\Delta = 0.5$, $\kappa^*=0.2$ for several values of the regularization $\lambda$ (solid lines), $\lambda \to 0^+$ (yellow line) and for optimal regularization (dashed line). We highlight the region of sample ratio $\alpha$ where non-regularized training loss goes to zero (before the vertical grey line, from Result \ref{['res:interpolation']}), which coincides with the development of a cusp in the test error as $\lambda$ decreases.
  • Figure 4: (Left) Interpolation threshold $\alpha_{\rm inter}(\kappa^*, \Delta)$ as a function of $\kappa^*$ for several values of label noise $\Delta$ (Result \ref{['res:interpolation']}). Notice the convergence to the $1/4$ random-label-fitting threshold for very narrow targets $\kappa^* \ll 1$ and large label noise $\Delta \gg 1$. (Center) Comparison between interpolation threshold (Result \ref{['res:interpolation']}, $\Delta = 0$) and strong recovery threshold (Corollary \ref{['res:strong']}) of the global minima of \ref{['eq:erm']}, with the BO strong recovery threshold maillard_bayes-optimal_2024. Minimal regularization interpolators of \ref{['eq:erm']} reach perfect recovery well before the interpolator set shrinks to a singleton on the target weights (the effect is more pronounced for very small ranks of the target function $\kappa^* \ll 1$. (Right) The test error of any global minimum of \ref{['eq:erm']} in the limit $\kappa^* \to 0$ (Result \ref{['res:small-teacher']}) for several values of regularization $\lambda = {\bar{\lambda}} / \sqrt{\kappa^*}$ and label noise $\Delta$, compared with the Bayes-optimal maillard_bayes-optimal_2024.

Theorems & Definitions (7)

  • Theorem 1: Asymptotics of ERM \ref{['eq:erm']}, informal
  • Corollary 1: Strong recovery threshold
  • Theorem 2: Gaussian Universality of the loss, from maillard_bayes-optimal_2024xu_fundamental_2025
  • Theorem 3: Universality of the overlaps dandi2023universality, informal
  • Theorem 4: Theorem 1 in berthierStateEvolutionApproximate2020, informal
  • Corollary 2: Fixed point initialization
  • Theorem 5: Convergence of GAMP, Lemma $7$ from loureiro2021learning