High-dimensional learning of narrow neural networks

Hugo Cui

High-dimensional learning of narrow neural networks

Hugo Cui

TL;DR

This article surveys the high-dimensional learning of neural networks with a finite number of hidden units through the sequence multi-index model (SM), unifying analyses across MLPs, autoencoders, and attention-based architectures. It leverages the replica method and generalized AMP (GAMP) to derive tight, closed-form descriptions of training loss, test error, and learned representations in the proportional limit where data dimension $d$ and sample size $n$ grow with fixed ratio $\alpha$. By mapping various architectures and tasks to SM, the work provides a cohesive framework that connects phase transitions, Bayes-optimal behavior under regularization, and semantic learning to algorithmic fixed points of GAMP and gradient descent. The insights illuminate both theoretical underpinnings and practical implications for neural-network design, data modeling, and scalable inference in high dimensions. The perspectives highlight open questions on data structure, width scalings, and dynamics, guiding future research at the intersection of statistical physics and ML theory.

Abstract

Recent years have been marked with the fast-pace diversification and increasing ubiquity of machine learning applications. Yet, a firm theoretical understanding of the surprising efficiency of neural networks to learn from high-dimensional data still proves largely elusive. In this endeavour, analyses inspired by statistical physics have proven instrumental, enabling the tight asymptotic characterization of the learning of neural networks in high dimensions, for a broad class of solvable models. This manuscript reviews the tools and ideas underlying recent progress in this line of work. We introduce a generic model -- the sequence multi-index model -- which encompasses numerous previously studied models as special instances. This unified framework covers a broad class of machine learning architectures with a finite number of hidden units, including multi-layer perceptrons, autoencoders, attention mechanisms; and tasks, including (un)supervised learning, denoising, contrastive learning, in the limit of large data dimension, and comparably large number of samples. We explicate in full detail the analysis of the learning of sequence multi-index models, using statistical physics techniques such as the replica method and approximate message-passing algorithms. This manuscript thus provides a unified presentation of analyses reported in several previous works, and a detailed overview of central techniques in the field of statistical physics of machine learning. This review should be a useful primer for machine learning theoreticians curious of statistical physics approaches; it should also be of value to statistical physicists interested in the transfer of such ideas to the study of neural networks.

High-dimensional learning of narrow neural networks

TL;DR

and sample size

grow with fixed ratio

. By mapping various architectures and tasks to SM, the work provides a cohesive framework that connects phase transitions, Bayes-optimal behavior under regularization, and semantic learning to algorithmic fixed points of GAMP and gradient descent. The insights illuminate both theoretical underpinnings and practical implications for neural-network design, data modeling, and scalable inference in high dimensions. The perspectives highlight open questions on data structure, width scalings, and dynamics, guiding future research at the intersection of statistical physics and ML theory.

Abstract

Paper Structure (79 sections, 141 equations, 5 figures, 6 algorithms)

This paper contains 79 sections, 141 equations, 5 figures, 6 algorithms.

Basic concepts in machine learning
Why machine learning theory?
The machine learning pipeline
Some machine learning models
Off-the-shelf feature maps
No feature map--
Kernel feature maps--
Random Features (RF) --
Tunable feature maps
Multi-Layer Perceptron (MLP)--
Autoencoders (AE) --
Transformers --
Challenges and open questions
Statistical physics of neural networks
Statistical physics in the ML researchscape
...and 64 more sections

Figures (5)

Figure 1: Graphical representation of some existing or possible models in asymptotic ML theories for MLPs (top), AEs (middle), or attention mechanisms (bottom). Each column corresponds to a different asymptotic limit for the NN architecture. From left to right: single hidden unit models, models with a finite number of hidden units, infinite-width models, and extensive-width models. Framed are the narrow architectures with $r=\Theta_d(1)$ hidden neurons which are reviewed and analyzed in unified fashion in the present review.
Figure 2: Special cases of interest of the sequence multi-index model \ref{['eq:data_distrib']}\ref{['eq:ERM']}, classified by different associated sequence lengths $L$. For $L=1$, GLMs and two-layer neural networks aubin2020generalizationcornacchia2023learningmignacco2020roleloureiro2021learning2, discussed in \ref{['subsec:perceptron']}. For $L=2$, AEs cui2023high, discussed in \ref{['subsec:DAE']}, siamese networks, discussed in \ref{['subsec:siamese']}, and RFs Gerace2022loureiro2021learningschroder2023deterministic, discussed in \ref{['subsec:RF']}. For $L\ge 2$, attention models cui2024phase, discussed in \ref{['subsec:attention']}.
Figure 3: Graphical model associated to the measure $\mathbb{P}_\beta$\ref{['eq:Z ']}. We used the shorthands $h_\mu(\boldsymbol{w})\equiv \mathrm{exp}(\beta \ell(\boldsymbol{x}^\mu\boldsymbol{w}_\star/\sqrt{d},\boldsymbol{x}^\mu \boldsymbol{w}/\sqrt{d},\boldsymbol{w}^\top \boldsymbol{w}/d, c^\mu)), g(w_i)=\mathrm{exp}(\beta \lambda/2\lVert w_i\lVert^2)$. Iterative schemes such as GAMP (\ref{['alg:GAMP']}) rangan2011generalizedrangan2016fixedjavanmard2013state can be used to estimate marginals from such distributions.
Figure 4: Example of learning curves obtained from the theoretical characterization \ref{['eq:intro:replica_SP_repeat']} of section \ref{['sec: Derivation']} (solid lines), contrasted with numerical experiments (dots), in dimensions $d=700, 1000$, of the corresponding networks trained with the Pytorch paszke2019pytorch implementation of the Adam kingma2014adam optimizer. (left) Reproduced from cui2023high. Denoising test error achieved by a DAE \ref{['eq:AE:model']}, minus that achieved by a simple linear baseline $f_b(\boldsymbol{x})=b\boldsymbol{x}$, as a function of the variance of the corrupting noise. (right) Reproduced from cui2024phase. Test error achieved by a single layer attention \ref{['eq:Attention:student']}. A first order phase transition happens at sample complexity $\alpha=\alpha_c$, signalling the learning by the model of a qualitatively new algorithmic mechanism. All the details can be found in the original works, and are not exhaustively reported here for conciseness.
Figure 5: (left) Statistical physics studies typically leverage stylized data assumptions, such as Gaussian densities (a) or Gaussian mixture densities (b). In sharp contrast, real data ((d) represents a tSNE van2008visualizing visualization of the MNIST lecun1998gradient train set) displays much more intricate structure. A strategy employed in e.g. bordelon2021learningloureiro2021learningloureiro2021learning2cui2023highRefinetti2022TheDO consists in considering surrogate analytical densities (c) with matching relevant statistics. This allows to obtain theoretical predictions capturing the learning curves of real data experiments, like denoising FashionMNIST xiao2017 with a DAE (right). The plot is reproduced from cui2023high, and represents the test error achieved by the DAE, minus that achieved by a simple baseline $f_b(\boldsymbol{x})=b\boldsymbol{x}$, see also caption of Fig. \ref{['fig:examples']}.

High-dimensional learning of narrow neural networks

TL;DR

Abstract

High-dimensional learning of narrow neural networks

Authors

TL;DR

Abstract

Table of Contents

Figures (5)