Table of Contents
Fetching ...

Towards moderate overparameterization: global convergence guarantees for training shallow neural networks

Samet Oymak, Mahdi Soltanolkotabi

TL;DR

The paper investigates global convergence of first-order methods for shallow neural networks under moderate overparameterization. It establishes geometric convergence to a zero-training-error solution for random initializations when kd scales appropriately with the data size, providing distinct bounds for smooth activations and ReLUs, and extending to SGD with comparable rates. The authors connect optimization guarantees to spectral properties of the Jacobian and a neural-net covariance, and show that a simple random-feature perspective suffices to achieve interpolation when k is sufficiently large relative to n. Numerical experiments corroborate phase transitions near the kd ≈ n threshold, highlighting practical relevance beyond width-based analyses. Overall, the work narrows the gap between theory and practice by proving global convergence with much less aggressive overparameterization than prior results, and by offering a framework extendable to deeper architectures and other losses.

Abstract

Many modern neural network architectures are trained in an overparameterized regime where the parameters of the model exceed the size of the training dataset. Sufficiently overparameterized neural network architectures in principle have the capacity to fit any set of labels including random noise. However, given the highly nonconvex nature of the training landscape it is not clear what level and kind of overparameterization is required for first order methods to converge to a global optima that perfectly interpolate any labels. A number of recent theoretical works have shown that for very wide neural networks where the number of hidden units is polynomially large in the size of the training data gradient descent starting from a random initialization does indeed converge to a global optima. However, in practice much more moderate levels of overparameterization seems to be sufficient and in many cases overparameterized models seem to perfectly interpolate the training data as soon as the number of parameters exceed the size of the training data by a constant factor. Thus there is a huge gap between the existing theoretical literature and practical experiments. In this paper we take a step towards closing this gap. Focusing on shallow neural nets and smooth activations, we show that (stochastic) gradient descent when initialized at random converges at a geometric rate to a nearby global optima as soon as the square-root of the number of network parameters exceeds the size of the training data. Our results also benefit from a fast convergence rate and continue to hold for non-differentiable activations such as Rectified Linear Units (ReLUs).

Towards moderate overparameterization: global convergence guarantees for training shallow neural networks

TL;DR

The paper investigates global convergence of first-order methods for shallow neural networks under moderate overparameterization. It establishes geometric convergence to a zero-training-error solution for random initializations when kd scales appropriately with the data size, providing distinct bounds for smooth activations and ReLUs, and extending to SGD with comparable rates. The authors connect optimization guarantees to spectral properties of the Jacobian and a neural-net covariance, and show that a simple random-feature perspective suffices to achieve interpolation when k is sufficiently large relative to n. Numerical experiments corroborate phase transitions near the kd ≈ n threshold, highlighting practical relevance beyond width-based analyses. Overall, the work narrows the gap between theory and practice by proving global convergence with much less aggressive overparameterization than prior results, and by offering a framework extendable to deeper architectures and other losses.

Abstract

Many modern neural network architectures are trained in an overparameterized regime where the parameters of the model exceed the size of the training dataset. Sufficiently overparameterized neural network architectures in principle have the capacity to fit any set of labels including random noise. However, given the highly nonconvex nature of the training landscape it is not clear what level and kind of overparameterization is required for first order methods to converge to a global optima that perfectly interpolate any labels. A number of recent theoretical works have shown that for very wide neural networks where the number of hidden units is polynomially large in the size of the training data gradient descent starting from a random initialization does indeed converge to a global optima. However, in practice much more moderate levels of overparameterization seems to be sufficient and in many cases overparameterized models seem to perfectly interpolate the training data as soon as the number of parameters exceed the size of the training data by a constant factor. Thus there is a huge gap between the existing theoretical literature and practical experiments. In this paper we take a step towards closing this gap. Focusing on shallow neural nets and smooth activations, we show that (stochastic) gradient descent when initialized at random converges at a geometric rate to a nearby global optima as soon as the square-root of the number of network parameters exceeds the size of the training data. Our results also benefit from a fast convergence rate and continue to hold for non-differentiable activations such as Rectified Linear Units (ReLUs).

Paper Structure

This paper contains 36 sections, 31 theorems, 177 equations, 2 figures.

Key Result

Theorem 2.1

Consider a data set of input/label pairs $\bm{x}_i\in\mathbb{R}^d$ and $y_i\in\mathbb{R}$ for $i=1,2,\ldots,n$ aggregated as rows/entries of a data matrix $\bm{X}\in\mathbb{R}^{n\times d}$ and a label vector $\bm{y}\in\mathbb{R}^n$. Without loss of generality we assume the dataset is normalized so t and $c$ is a fixed numerical constant, then with probability at least $1-\frac{1}{n}-e^{-\delta^2\f

Figures (2)

  • Figure 1: Illustration of a one-hidden layer neural net with $d$ inputs, $k$ hidden units and a single output.
  • Figure 2: Phase transitions for overparameterization. These diagrams show the empirical probability that gradient descent from a random initialization successfully fits $n$ random labels $\bm{y}\in\mathbb{R}^n$ when a one-hidden layer neural network is used. Here, $d$ is the input dimension, $k$ the number of hidden units, and $n$ the size of the training data. The colormap tapers between red and blue where red represents certain success, while blue represents certain failure. The solid white line highlights $n=kd$ i.e. when the size of the training data is equal to the number of parameters.

Theorems & Definitions (34)

  • Theorem 2.1
  • Corollary 2.2
  • Theorem 2.3
  • Corollary 2.4
  • Theorem 2.5
  • Theorem 2.6
  • Definition 3.1: Output feature covariance and eigenvalue
  • Theorem 3.2
  • Definition 6.1: Neural network covariance matrix and eigenvalue
  • Theorem 6.2: Meta-theorem for smooth activations
  • ...and 24 more