Table of Contents
Fetching ...

Global Convergence of SGD For Logistic Loss on Two Layer Neural Nets

Pulkit Gopalani, Samyak Jha, Anirbit Mukherjee

Abstract

In this note, we demonstrate a first-of-its-kind provable convergence of SGD to the global minima of appropriately regularized logistic empirical risk of depth $2$ nets -- for arbitrary data and with any number of gates with adequately smooth and bounded activations like sigmoid and tanh. We also prove an exponentially fast convergence rate for continuous time SGD that also applies to smooth unbounded activations like SoftPlus. Our key idea is to show the existence of Frobenius norm regularized logistic loss functions on constant-sized neural nets which are "Villani functions" and thus be able to build on recent progress with analyzing SGD on such objectives.

Global Convergence of SGD For Logistic Loss on Two Layer Neural Nets

Abstract

In this note, we demonstrate a first-of-its-kind provable convergence of SGD to the global minima of appropriately regularized logistic empirical risk of depth nets -- for arbitrary data and with any number of gates with adequately smooth and bounded activations like sigmoid and tanh. We also prove an exponentially fast convergence rate for continuous time SGD that also applies to smooth unbounded activations like SoftPlus. Our key idea is to show the existence of Frobenius norm regularized logistic loss functions on constant-sized neural nets which are "Villani functions" and thus be able to build on recent progress with analyzing SGD on such objectives.
Paper Structure (20 sections, 14 theorems, 65 equations, 3 figures)

This paper contains 20 sections, 14 theorems, 65 equations, 3 figures.

Key Result

Theorem 1.1

If the initial weights are sampled from an appropriate class of distributions, then for nets with a single layer of sigmoid or tanh gates -- for arbitrary data and size of the net -- SGD on appropriately regularized logistic loss, while using constant steps of size ${\mathcal{O}}(\epsilon)$, will co

Figures (3)

  • Figure 1: Test Accuracy across various widths $p$ and regularizer $\lambda$
  • Figure 2: Batch Size = 3000, $\lambda = \lambda_c = 0.03125$ and the net being trained has $12$ sigmoid gates
  • Figure 3: Batch Size = 3000, $\lambda = \lambda_c = 0.03125$ and the net being trained has $12$ sigmoid gates.

Theorems & Definitions (22)

  • Theorem 1.1: Informal Statement of Lemma \ref{['thm:sgd-sig-bce']}
  • Lemma 1.2
  • Definition 1: Constant Step-Size SGD On Depth-2 Nets
  • Definition 2: Properties of the Activation $\sigma$
  • Lemma 3.1: for Classification with Logistic Loss
  • Theorem 3.2: Bounds on Error for Arbitrary Initialization.
  • Lemma 3.3: Global Convergence of SGD on Sigmoid and Tanh Neural Nets of 2 Layers for Any Width and Any Data - for Binary Classification With Logistic Loss
  • Definition 3: SoftPlus activation
  • Remark
  • Lemma 3.4
  • ...and 12 more