Global Convergence of SGD On Two Layer Neural Nets

Pulkit Gopalani; Anirbit Mukherjee

Global Convergence of SGD On Two Layer Neural Nets

Pulkit Gopalani, Anirbit Mukherjee

TL;DR

The paper addresses convergence of SGD on shallow, depth-2 neural nets under a Frobenius-norm regularization, showing that the regularized $\ell_2$ loss can be made a Villani function and enabling global convergence guarantees without width/data constraints. Using the SGD–SDE framework, the authors derive nonasymptotic bounds for SGD on sigmoid/tanh nets, with a width-independent regularization threshold $\lambda_c$; they also prove exponential convergence for continuous-time SGD with SoftPlus activations. A central contribution is bridging finite-width, nonkernel regimes by leveraging Villani-function dynamics rather than NTK/mean-field limits. The work is complemented by experimental attestations of regularization effects across widths and a SoftPlus–based theory, offering a pathway to more general provable training results beyond large-width assumptions.

Abstract

In this note, we consider appropriately regularized $\ell_2-$empirical risk of depth $2$ nets with any number of gates and show bounds on how the empirical loss evolves for SGD iterates on it -- for arbitrary data and if the activation is adequately smooth and bounded like sigmoid and tanh. This in turn leads to a proof of global convergence of SGD for a special class of initializations. We also prove an exponentially fast convergence rate for continuous time SGD that also applies to smooth unbounded activations like SoftPlus. Our key idea is to show the existence of Frobenius norm regularized loss functions on constant-sized neural nets which are "Villani functions" and thus be able to build on recent progress with analyzing SGD on such objectives. Most critically the amount of regularization required for our analysis is independent of the size of the net.

Global Convergence of SGD On Two Layer Neural Nets

TL;DR

The paper addresses convergence of SGD on shallow, depth-2 neural nets under a Frobenius-norm regularization, showing that the regularized

loss can be made a Villani function and enabling global convergence guarantees without width/data constraints. Using the SGD–SDE framework, the authors derive nonasymptotic bounds for SGD on sigmoid/tanh nets, with a width-independent regularization threshold

; they also prove exponential convergence for continuous-time SGD with SoftPlus activations. A central contribution is bridging finite-width, nonkernel regimes by leveraging Villani-function dynamics rather than NTK/mean-field limits. The work is complemented by experimental attestations of regularization effects across widths and a SoftPlus–based theory, offering a pathway to more general provable training results beyond large-width assumptions.

Abstract

In this note, we consider appropriately regularized

empirical risk of depth

nets with any number of gates and show bounds on how the empirical loss evolves for SGD iterates on it -- for arbitrary data and if the activation is adequately smooth and bounded like sigmoid and tanh. This in turn leads to a proof of global convergence of SGD for a special class of initializations. We also prove an exponentially fast convergence rate for continuous time SGD that also applies to smooth unbounded activations like SoftPlus. Our key idea is to show the existence of Frobenius norm regularized loss functions on constant-sized neural nets which are "Villani functions" and thus be able to build on recent progress with analyzing SGD on such objectives. Most critically the amount of regularization required for our analysis is independent of the size of the net.

Paper Structure (17 sections, 7 theorems, 46 equations, 6 figures)

This paper contains 17 sections, 7 theorems, 46 equations, 6 figures.

Introduction
Organization
Related Work
Review of the NTK Approach To Provable Neural Training :
Review of the Mean-Field Approach To Provable Neural Net Training :
Need And Attempts To Go Beyond Large Width Limits of Nets
Related Work on Provable Training of Neural Networks Using Regularization
Setup and Main Results
Global Convergence of Continuous Time SGD on Nets with SoftPlus Gates
An Experimental Study of the Effect of Regularization at Various Widths
Overview of weijie_sde
Proof of Theorem \ref{['lem:error_bound']}
Ablation Study with Noisy Labels
Conclusion
Towards Establishing the Villani condition for the Empirical Loss on Nets
...and 2 more sections

Key Result

Theorem 1.1

If the initial weights are sampled from an appropriate class of distributions (dependent on the choice of accuracy parameter $\epsilon$), then for nets with a single layer of sigmoid or tanh gates -- for arbitrary data and size of the net -- SGD on $\ell_2-$losses on such architectures regularized w

Figures (6)

Figure 1: Best test loss across a range of $(\lambda,p = {\rm width})$ for $({\mathbf{x}},y)$ data labeled as,
Figure 2: Best test loss across a range of $(\lambda,{\rm width} = p)$ for realizable data $({\mathbf{x}},y)$ sampled as,
Figure 3: Regression, $(p, \eta, \lambda) = (10, 1\text{e-}2, 0.013)$
Figure 4: Regression, $(p, \eta, \lambda) = (10, 1\text{e-}2, 0.13)$
Figure 5: Regression, $(p, \eta, \lambda) = (50, 5\text{e-}3, 0.013)$
...and 1 more figures

Theorems & Definitions (14)

Theorem 1.1: Informal Statement of Corollary \ref{['thm:sgd-sig']}
Lemma 1.2
Definition 1: Constant Step-Size SGD On Depth-2 Nets
Definition 2: Properties of the Activation $\sigma$
Lemma 3.1
Theorem 3.2: Global Convergence of SGD on Sigmoid and Tanh Neural Nets of $2$ Layers for Any Width and Data, Arbitrary Initialization.
Corollary 3.3: Global Convergence of SGD on Sigmoid and Tanh Neural Nets of $2$ Layers for Any Width and Data
Definition 3: SoftPlus activation
Remark
Theorem 3.4: Convergence To Global Minima of Continuous Time SGD on Depth$-2$ SoftPlus Nets
...and 4 more

Global Convergence of SGD On Two Layer Neural Nets

TL;DR

Abstract

Global Convergence of SGD On Two Layer Neural Nets

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (14)