Convergence of Shallow ReLU Networks on Weakly Interacting Data
Léo Dana, Francis Bach, Loucas Pillaud-Vivien
TL;DR
This work analyzes the training dynamics of a one-hidden-layer ReLU network trained by gradient flow on n data points in a high-dimensional setting with weak input correlations. It establishes that a width of order $p\sim\log(n)$ suffices for global convergence to interpolation with high probability, leveraging a local Polyak-Łojasiewicz framework and high-dimensional concentration. For orthogonal data, it refines the convergence rate, showing the exponential rate lies between $\mathcal{O}(1/n)$ and $\mathcal{O}(1/\sqrt{n})$, and reveals a phase-transition phenomenon in the PL curvature during training. Complementary experiments corroborate the $\mu(t)$ scaling and threshold behaviors, indicating practical implications for designing shallow networks in high-dimensional regimes and informing future extensions to broader data regimes and deeper models.
Abstract
We analyse the convergence of one-hidden-layer ReLU networks trained by gradient flow on $n$ data points. Our main contribution leverages the high dimensionality of the ambient space, which implies low correlation of the input samples, to demonstrate that a network with width of order $\log(n)$ neurons suffices for global convergence with high probability. Our analysis uses a Polyak-Łojasiewicz viewpoint along the gradient-flow trajectory, which provides an exponential rate of convergence of $\frac{1}{n}$. When the data are exactly orthogonal, we give further refined characterizations of the convergence speed, proving its asymptotic behavior lies between the orders $\frac{1}{n}$ and $\frac{1}{\sqrt{n}}$, and exhibiting a phase-transition phenomenon in the convergence rate, during which it evolves from the lower bound to the upper, and in a relative time of order $\frac{1}{\log(n)}$.
