SGD learning on neural networks: leap complexity and saddle-to-saddle dynamics

Emmanuel Abbe; Enric Boix-Adsera; Theodor Misiakiewicz

SGD learning on neural networks: leap complexity and saddle-to-saddle dynamics

Emmanuel Abbe, Enric Boix-Adsera, Theodor Misiakiewicz

TL;DR

This work investigates the time complexity of SGD learning on regular fully-connected neural networks trained on isotropic data with low latent dimensionality. It introduces the leap complexity to quantify function hierarchy and proves, for a Gaussian setting with a 2-layer network, that SGD learns a target in time scaling as latTheta(d^{max(Leap(f),2)}) up to poly(1/ε), via a saddle-to-saddle sequential learning dynamics. It establishes CSQ lower bounds that match the proposed upper bounds, and frames SGD as implementing an adaptive curriculum that learns low-level features first and builds up to higher-order monomials. The results generalize prior leap-1 analyses and provide a rigorous connection between practical SGD dynamics and information-theoretic lower bounds, with experimental evidence and clear avenues for extension to broader architectures and data distributions.

Abstract

We investigate the time complexity of SGD learning on fully-connected neural networks with isotropic data. We put forward a complexity measure -- the leap -- which measures how "hierarchical" target functions are. For $d$-dimensional uniform Boolean or isotropic Gaussian data, our main conjecture states that the time complexity to learn a function $f$ with low-dimensional support is $\tildeΘ(d^{\max(\mathrm{Leap}(f),2)})$. We prove a version of this conjecture for a class of functions on Gaussian isotropic data and 2-layer neural networks, under additional technical assumptions on how SGD is run. We show that the training sequentially learns the function support with a saddle-to-saddle dynamic. Our result departs from [Abbe et al. 2022] by going beyond leap 1 (merged-staircase functions), and by going beyond the mean-field and gradient flow approximations that prohibit the full complexity control obtained here. Finally, we note that this gives an SGD complexity for the full training trajectory that matches that of Correlational Statistical Query (CSQ) lower-bounds.

SGD learning on neural networks: leap complexity and saddle-to-saddle dynamics

TL;DR

Abstract

-dimensional uniform Boolean or isotropic Gaussian data, our main conjecture states that the time complexity to learn a function

with low-dimensional support is

. We prove a version of this conjecture for a class of functions on Gaussian isotropic data and 2-layer neural networks, under additional technical assumptions on how SGD is run. We show that the training sequentially learns the function support with a saddle-to-saddle dynamic. Our result departs from [Abbe et al. 2022] by going beyond leap 1 (merged-staircase functions), and by going beyond the mean-field and gradient flow approximations that prohibit the full complexity control obtained here. Finally, we note that this gives an SGD complexity for the full training trajectory that matches that of Correlational Statistical Query (CSQ) lower-bounds.

Paper Structure (44 sections, 25 theorems, 272 equations, 7 figures, 1 table, 1 algorithm)

This paper contains 44 sections, 25 theorems, 272 equations, 7 figures, 1 table, 1 algorithm.

Introduction
IID inputs and low-dimensional latent dimension.
The example of staircases.
The leap complexity
Summary of our contributions
Overview.
Formal results.
Related works
Lower bounds on learning leap functions
Learning leap functions with SGD on neural networks
Algorithm
Learning a single monomial
Learning multiple monomials
Discussion
Summary of contributions:
...and 29 more sections

Key Result

Proposition 1

Let $h_*$ be a degree-$D$ polynomial over the Boolean hypercube (resp., Gaussian measure). Then there are $c_{h_*}, \varepsilon_{h_*} > 0$, such that any linear method needs $c_{h_*} d^D$ samples to learn $f_*({\boldsymbol x}) = h_*({\boldsymbol M} {\boldsymbol x})$ to less than $\varepsilon_{h_*} >

Figures (7)

Figure 1: Test error versus the number of online-SGD steps to learn $h_*({\boldsymbol z}) = z_1 + z_1 z_2 \cdots z_5 + z_1 z_2 \cdots z_9 + z_1z_2 \cdots z_{14}$ in ambient dimension $d=100$ on the hypercube. We take $M = 300$ neurons with shifted sigmoid activation and train both layers at once with constant step size $0.4/d$. The SGD dynamics follows a saddle-to-saddle dynamic and sequentially picks up the support and monomials $z_1$ in roughly $d$ steps, $z_1z_2 \cdots z_5$ in $d^3$ steps (leap of size $4$), $z_1z_2 \cdots z_9$ in $d^3$ steps (leap of size $4$) and $z_1z_2 \cdots z_{14}$ in $d^4$ steps (leap of size $5$).
Figure 2: In this figure we consider training a 5-layer ResNet with fully-connected layers with SGD the leap-3 function $h_*({\boldsymbol z}) = 2\cdot\prod_{i=1}^2 \tanh(z_i) + 5\cdot\prod_{i=1}^5 \tanh(z_i)$ with data ${\boldsymbol x} \sim {\sf N}(0,I_d)$ and $d = 50$. While our paper considered bounded degree polynomials, the leap complexity, which drives the sequential alignment to the support, also holds for non-polynomial functions. In this case, the leap depends on the first non-zero monomials in the Hermite decomposition. For $h_*$ considered in this plot, we have first a leap of size 2 to align with $x_1,x_2$ followed by a leap of size 3 to align with $x_3x_4x_5$. In the plot of test risk over time, we indeed see first a short saddle to align with $x_1 , x_2$, followed by a quick decrease of the loss (corresponding to the neural networks fitting $2\tanh(z_1)\tanh(z_2)$). This is followed by a plateau while SGD slowly picks up $x_3,x_4,x_5$ (saddle) and a sharp decrease in the loss when the neural network fit the remainder of $h_*$. We also plot the heatmap of the absolute value of the entries of ${\boldsymbol W}^{\top} {\boldsymbol W} \in \mathbb{R}^{d \times d}$ where ${\boldsymbol W}$ is the first-layer matrix after training. This shows that the first layer indeed picks up the relevant coordinates (first $5$ coordinates) in the support after training.
Figure 3: In (a)-(d) we show the evolution of the risk for training a 5-layer ResNet with fully-connected layers with SGD to learn the leap-3 function $h_*({\boldsymbol z}) = z_1 + z_1z_2z_3 + z_1z_2z_3z_4z_5z_6$ with binary hypercube data in ambient dimension $d = 50, 100, 200, 400$, respectively. Notice that the evolution of the risk follows a saddle-to-saddle dynamic. This dynamic becomes more salient as the ambient dimension increases and escaping the saddles dominates the SGD trajectory.
Figure 4: We consider training a 5-layer ResNet with fully-connected layers with SGD on covariate distribution ${\boldsymbol x} \sim {\sf N}(0,I_d)$ with $d = 500$. In (a) we show the risk from learning the leap-3 function $h_*({\boldsymbol z}) = {\rm He}_3(z_1)$, and in (b) we show the risk from learning the leap-1 function $h_*({\boldsymbol z}) = {\rm He}_1(z_1) + {\rm He}_3(z_1)$. Notice that the leap-3 task is much more difficult for SGD, and it gets stuck in a saddle where the loss plateaus. On the other hand, the ${\rm He}_1(z_1)$ term in the leap-1 task means that SGD is not stuck in a saddle.
Figure 5: A width-$1000$ 5-layer ResNet network with ReLU activation trained with one-pass SGD with mini-batch size $100$ and step size $0.1$. The data is ${\boldsymbol x} \sim \{+1,-1\}^{d}$ for ambient dimension $d = 50$, and $h_*(z) = z_1 + z_1z_2z_3z_4 + z_1z_2z_3z_4z_5z_6z_7z_8$, which is a leap-4 function. We observe saddle-to-saddle dynamics. And we observe that the first layer picks up the relevant support iteratively.
...and 2 more figures

Theorems & Definitions (51)

Definition 1: Leap complexity
Conjecture 1
Remark 1
Proposition 1: Lower bound for linear methods; informal statement of Propositions \ref{['prop:degree-linear-boolean']} and \ref{['prop:degree-linear-gaussian']}
Proposition 2: Lower bound for CSQ methods; informal statement of Propositions \ref{['prop:leap-csq-boolean']} and \ref{['prop:isoleap-csq-gaussian']}
Remark 2
Theorem 1: First layer training, single monomial, sum of monomials
Corollary 1: Second layer training, single monomial
Theorem 2: First layer training
Corollary 2: Second layer training, sum of monomials; informal statement
...and 41 more

SGD learning on neural networks: leap complexity and saddle-to-saddle dynamics

TL;DR

Abstract

SGD learning on neural networks: leap complexity and saddle-to-saddle dynamics

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (51)