Global Convergence of SGD For Logistic Loss on Two Layer Neural Nets

Pulkit Gopalani; Samyak Jha; Anirbit Mukherjee

Global Convergence of SGD For Logistic Loss on Two Layer Neural Nets

Pulkit Gopalani, Samyak Jha, Anirbit Mukherjee

Abstract

In this note, we demonstrate a first-of-its-kind provable convergence of SGD to the global minima of appropriately regularized logistic empirical risk of depth $2$ nets -- for arbitrary data and with any number of gates with adequately smooth and bounded activations like sigmoid and tanh. We also prove an exponentially fast convergence rate for continuous time SGD that also applies to smooth unbounded activations like SoftPlus. Our key idea is to show the existence of Frobenius norm regularized logistic loss functions on constant-sized neural nets which are "Villani functions" and thus be able to build on recent progress with analyzing SGD on such objectives.

Global Convergence of SGD For Logistic Loss on Two Layer Neural Nets

Abstract

In this note, we demonstrate a first-of-its-kind provable convergence of SGD to the global minima of appropriately regularized logistic empirical risk of depth

nets -- for arbitrary data and with any number of gates with adequately smooth and bounded activations like sigmoid and tanh. We also prove an exponentially fast convergence rate for continuous time SGD that also applies to smooth unbounded activations like SoftPlus. Our key idea is to show the existence of Frobenius norm regularized logistic loss functions on constant-sized neural nets which are "Villani functions" and thus be able to build on recent progress with analyzing SGD on such objectives.

Paper Structure (20 sections, 14 theorems, 65 equations, 3 figures)

This paper contains 20 sections, 14 theorems, 65 equations, 3 figures.

Introduction
Organization
Related Work
Review of the NTK Approach To Provable Neural Training :
Review of the Mean-Field Approach To Provable Neural Net Training :
Need And Attempts To Go Beyond Large Width Limits of Nets
Related Work on Provable Training of Neural Networks Using Regularization
Setup and Main Results
Global Convergence of Continuous Time SGD on Nets with SoftPlus Gates
An Experimental Demonstration of the Maintenance of Classification Accuracy At Various Regularizations at Different Widths
Overview of weijie_sde
Proof of Theorem \ref{['thm:error_bound']}
Conclusion
Towards Establishing the Villani condition for the Empirical Logistic Loss
Bounding the Gradient Lipschitzness Coefficient of the Empirical Logistic Loss
...and 5 more sections

Key Result

Theorem 1.1

If the initial weights are sampled from an appropriate class of distributions, then for nets with a single layer of sigmoid or tanh gates -- for arbitrary data and size of the net -- SGD on appropriately regularized logistic loss, while using constant steps of size ${\mathcal{O}}(\epsilon)$, will co

Figures (3)

Figure 1: Test Accuracy across various widths $p$ and regularizer $\lambda$
Figure 2: Batch Size = 3000, $\lambda = \lambda_c = 0.03125$ and the net being trained has $12$ sigmoid gates
Figure 3: Batch Size = 3000, $\lambda = \lambda_c = 0.03125$ and the net being trained has $12$ sigmoid gates.

Theorems & Definitions (22)

Theorem 1.1: Informal Statement of Lemma \ref{['thm:sgd-sig-bce']}
Lemma 1.2
Definition 1: Constant Step-Size SGD On Depth-2 Nets
Definition 2: Properties of the Activation $\sigma$
Lemma 3.1: for Classification with Logistic Loss
Theorem 3.2: Bounds on Error for Arbitrary Initialization.
Lemma 3.3: Global Convergence of SGD on Sigmoid and Tanh Neural Nets of 2 Layers for Any Width and Any Data - for Binary Classification With Logistic Loss
Definition 3: SoftPlus activation
Remark
Lemma 3.4
...and 12 more

Global Convergence of SGD For Logistic Loss on Two Layer Neural Nets

Abstract

Global Convergence of SGD For Logistic Loss on Two Layer Neural Nets

Authors

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (22)