Escaping mediocrity: how two-layer networks learn hard generalized linear models with SGD

Luca Arnaboldi; Florent Krzakala; Bruno Loureiro; Ludovic Stephan

Escaping mediocrity: how two-layer networks learn hard generalized linear models with SGD

Luca Arnaboldi, Florent Krzakala, Bruno Loureiro, Ludovic Stephan

TL;DR

This work analyzes one-pass SGD for a fully connected two-layer network learning a quadratic phase-retrieval target in high dimensions, revealing a landscape with many flat directions at initialization (mediocrity). By reducing SGD to a low-dimensional process over sufficient statistics and taking the $d\to\infty$ limit, the authors derive deterministic ODEs that exactly capture the dynamics and expose the exit-time behavior needed to escape mediocrity. They establish precise sample-complexity results $n=O(d\log d)$ and show that overparameterization improves convergence only by a constant factor; stochastic corrections to exit time are negligible, implying the deterministic limit suffices for this problem. The findings illuminate why certain hard-target regimes resist acceleration via width or stochasticity and provide practical guidance on optimal learning-rate choices, with code available for replication.

Abstract

This study explores the sample complexity for two-layer neural networks to learn a generalized linear target function under Stochastic Gradient Descent (SGD), focusing on the challenging regime where many flat directions are present at initialization. It is well-established that in this scenario $n=O(d \log d)$ samples are typically needed. However, we provide precise results concerning the pre-factors in high-dimensional contexts and for varying widths. Notably, our findings suggest that overparameterization can only enhance convergence by a constant factor within this problem class. These insights are grounded in the reduction of SGD dynamics to a stochastic process in lower dimensions, where escaping mediocrity equates to calculating an exit time. Yet, we demonstrate that a deterministic approximation of this process adequately represents the escape time, implying that the role of stochasticity may be minimal in this scenario.

Escaping mediocrity: how two-layer networks learn hard generalized linear models with SGD

TL;DR

limit, the authors derive deterministic ODEs that exactly capture the dynamics and expose the exit-time behavior needed to escape mediocrity. They establish precise sample-complexity results

and show that overparameterization improves convergence only by a constant factor; stochastic corrections to exit time are negligible, implying the deterministic limit suffices for this problem. The findings illuminate why certain hard-target regimes resist acceleration via width or stochasticity and provide practical guidance on optimal learning-rate choices, with code available for replication.

Abstract

samples are typically needed. However, we provide precise results concerning the pre-factors in high-dimensional contexts and for varying widths. Notably, our findings suggest that overparameterization can only enhance convergence by a constant factor within this problem class. These insights are grounded in the reduction of SGD dynamics to a stochastic process in lower dimensions, where escaping mediocrity equates to calculating an exit time. Yet, we demonstrate that a deterministic approximation of this process adequately represents the escape time, implying that the role of stochasticity may be minimal in this scenario.

Paper Structure (40 sections, 142 equations, 6 figures)

This paper contains 40 sections, 142 equations, 6 figures.

Introduction
Summary of results ---
Further related work ---
High-dimensional limit of SGD
Sufficient statistics ---
High-dimensional limit ---
Initialization and mediocrity ---
Spherical constraint ---
Escaping mediocrity in the well-specified scenario
Population risk landscape
Exit time from deterministic limit
Does stochasticity matters?
The role of width
Training the second layer
Conclusion
...and 25 more sections

Figures (6)

Figure 1: multiple run of the simulated SGD and the numerically integrated SDE, always starting from the same initial condition, with $d=3000$. All the $t_\text{ext}$ presented are obtained by solving numerically \ref{['eq:exit_time_implicit_equation']}. The SDE captures the variance that the ODE doesn't exhibit, but the $t_\text{ext}$ do not change considerably.
Figure 2: Ratio between the measured $t_\text{ext}$ from simulations and the corresponding analytical formula (square = annealed, circle = quenched). We average over many initial conditions, for different values of $p$. The ratio $\gamma/p$ has been kept constant for different simulations.
Figure 3: $p=20$ (left), $p=50$ (right), $d = 1000$. Comparison between the growth of $\max{m}$ throughout the learning process, when the second layer is fixed (blue) and trained (green). The dynamics is obviously different far from the starting point, but when we zoom close to the exit point, the two processes have the same behavior, $t_\text{ext}$ included.
Figure 4: comparison of ODE integration and many SGD runs for $p=5$ (left) and $p=20$ (right). Both the experiments have $d=5000$.
Figure 5: multiple run of the simulated SGD and the numerically integrated SDE, always starting from the same initial condition, with $d=3000$. All the $t_\text{ext}$ presented are obtained by solving numerically \ref{['eq:exit_time_implicit_equation']}. The SDE captures the variance that the ODE doesn't exhibit, but the $t_\text{ext}$ do not change considerably.
...and 1 more figures

Escaping mediocrity: how two-layer networks learn hard generalized linear models with SGD

TL;DR

Abstract

Escaping mediocrity: how two-layer networks learn hard generalized linear models with SGD

Authors

TL;DR

Abstract

Table of Contents

Figures (6)