Escaping mediocrity: how two-layer networks learn hard generalized linear models with SGD
Luca Arnaboldi, Florent Krzakala, Bruno Loureiro, Ludovic Stephan
TL;DR
This work analyzes one-pass SGD for a fully connected two-layer network learning a quadratic phase-retrieval target in high dimensions, revealing a landscape with many flat directions at initialization (mediocrity). By reducing SGD to a low-dimensional process over sufficient statistics and taking the $d\to\infty$ limit, the authors derive deterministic ODEs that exactly capture the dynamics and expose the exit-time behavior needed to escape mediocrity. They establish precise sample-complexity results $n=O(d\log d)$ and show that overparameterization improves convergence only by a constant factor; stochastic corrections to exit time are negligible, implying the deterministic limit suffices for this problem. The findings illuminate why certain hard-target regimes resist acceleration via width or stochasticity and provide practical guidance on optimal learning-rate choices, with code available for replication.
Abstract
This study explores the sample complexity for two-layer neural networks to learn a generalized linear target function under Stochastic Gradient Descent (SGD), focusing on the challenging regime where many flat directions are present at initialization. It is well-established that in this scenario $n=O(d \log d)$ samples are typically needed. However, we provide precise results concerning the pre-factors in high-dimensional contexts and for varying widths. Notably, our findings suggest that overparameterization can only enhance convergence by a constant factor within this problem class. These insights are grounded in the reduction of SGD dynamics to a stochastic process in lower dimensions, where escaping mediocrity equates to calculating an exit time. Yet, we demonstrate that a deterministic approximation of this process adequately represents the escape time, implying that the role of stochasticity may be minimal in this scenario.
