Table of Contents
Fetching ...

Convergence of Shallow ReLU Networks on Weakly Interacting Data

Léo Dana, Francis Bach, Loucas Pillaud-Vivien

TL;DR

This work analyzes the training dynamics of a one-hidden-layer ReLU network trained by gradient flow on n data points in a high-dimensional setting with weak input correlations. It establishes that a width of order $p\sim\log(n)$ suffices for global convergence to interpolation with high probability, leveraging a local Polyak-Łojasiewicz framework and high-dimensional concentration. For orthogonal data, it refines the convergence rate, showing the exponential rate lies between $\mathcal{O}(1/n)$ and $\mathcal{O}(1/\sqrt{n})$, and reveals a phase-transition phenomenon in the PL curvature during training. Complementary experiments corroborate the $\mu(t)$ scaling and threshold behaviors, indicating practical implications for designing shallow networks in high-dimensional regimes and informing future extensions to broader data regimes and deeper models.

Abstract

We analyse the convergence of one-hidden-layer ReLU networks trained by gradient flow on $n$ data points. Our main contribution leverages the high dimensionality of the ambient space, which implies low correlation of the input samples, to demonstrate that a network with width of order $\log(n)$ neurons suffices for global convergence with high probability. Our analysis uses a Polyak-Łojasiewicz viewpoint along the gradient-flow trajectory, which provides an exponential rate of convergence of $\frac{1}{n}$. When the data are exactly orthogonal, we give further refined characterizations of the convergence speed, proving its asymptotic behavior lies between the orders $\frac{1}{n}$ and $\frac{1}{\sqrt{n}}$, and exhibiting a phase-transition phenomenon in the convergence rate, during which it evolves from the lower bound to the upper, and in a relative time of order $\frac{1}{\log(n)}$.

Convergence of Shallow ReLU Networks on Weakly Interacting Data

TL;DR

This work analyzes the training dynamics of a one-hidden-layer ReLU network trained by gradient flow on n data points in a high-dimensional setting with weak input correlations. It establishes that a width of order suffices for global convergence to interpolation with high probability, leveraging a local Polyak-Łojasiewicz framework and high-dimensional concentration. For orthogonal data, it refines the convergence rate, showing the exponential rate lies between and , and reveals a phase-transition phenomenon in the PL curvature during training. Complementary experiments corroborate the scaling and threshold behaviors, indicating practical implications for designing shallow networks in high-dimensional regimes and informing future extensions to broader data regimes and deeper models.

Abstract

We analyse the convergence of one-hidden-layer ReLU networks trained by gradient flow on data points. Our main contribution leverages the high dimensionality of the ambient space, which implies low correlation of the input samples, to demonstrate that a network with width of order neurons suffices for global convergence with high probability. Our analysis uses a Polyak-Łojasiewicz viewpoint along the gradient-flow trajectory, which provides an exponential rate of convergence of . When the data are exactly orthogonal, we give further refined characterizations of the convergence speed, proving its asymptotic behavior lies between the orders and , and exhibiting a phase-transition phenomenon in the convergence rate, during which it evolves from the lower bound to the upper, and in a relative time of order .

Paper Structure

This paper contains 43 sections, 14 theorems, 99 equations, 6 figures.

Key Result

Lemma 1

For all $j \in \llbracket1,p \rrbracket$, for all $t \geq 0$, $|a_j(t)|^2 - ||w_j(t)||^2 = |a_j(0)|^2 - ||w_j(0)||^2$ , and thus, if $|a_j(0)| \geq ||w_j(0)||$, then $a_j(t)$ maintains its sign and $|a_j(t)| \geq ||w_j(t)||$.

Figures (6)

  • Figure 1: Example of group initialization with $p_n = 3$ neurons, $k_n=2$ examples per neurons, for $n=6$ total examples. Group initialization allows to treat each group independently from the other, an thus to solve the problem for a 1-neuron network.
  • Figure 2: Simulation of the loss trajectory of a network with 2 neurons and group initialization, each activated separately on half the data points. $L_{n}$ is the rescaled loss for $n$ examples, and $L_{\infty}$ is its limit as $n$ goes to infinity. We can see two phase transitions in the very high-dimensional regime.
  • Figure 3: Left: Probability that a network trained on $n$ data converges to 0 loss. We observe a transition at $n=3000$, from likely to unlikely convergence. Right: Loss at convergence normalized by the loss at initialization. For $n\geq 3000$, the loss increases to $0.6\%$, which is equivalent to fitting all but one example.
  • Figure 4: This graph shows the scaling law of the convergence threshold for a fixed number of neurons. It suggests that the scaling is linear in $d$: $N(d,p) = C(p)d$.
  • Figure 5: This graph show the scaling law of the convergence threshold for a fixed number of neurons. It suggests that the scaling is not linear in $d$, but it is hard to differentiate between a sub-linear polynomial growth or a logarithmic growth.
  • ...and 1 more figures

Theorems & Definitions (29)

  • Lemma 1
  • Lemma 2
  • Theorem 1
  • Lemma 3
  • Corollary 1
  • Proposition 1
  • Proposition 2
  • Conjecture 1
  • Lemma 4
  • Theorem 2
  • ...and 19 more