Table of Contents
Fetching ...

Convergence of continuous-time stochastic gradient descent with applications to deep neural networks

Gabor Lugosi, Eulalia Nualart

TL;DR

The paper investigates the convergence of a continuous-time stochastic gradient descent model, expressed as $\mathrm{d}w_t=-\nabla f(w_t)\,dt+\sqrt{\eta}\,\sigma(w_t)\,dB_t$, to global minima of the population loss $f(w)=\mathbb{E}[\ell(w,Z)]$ under stochastic noise. Building on Chatterjee's deterministic criteria, it develops an Itô-calculus framework with stopping times and Lyapunov-type quantities, yielding explicit conditions on $f$ and $\sigma$ (via $A_{\min}, G_{\max}, B_{\max}, \theta$) under which convergence occurs with positive probability and at an exponential rate when initialized near a global minimum. The results are specialized to overparameterized neural networks, showing that with smooth activations and bounded inputs one can satisfy the required inequalities, and that high-probability convergence to the global minimum set $\mathcal{S}$ can be achieved with sufficiently large final-layer weights and small noise. The work thus provides a rigorous link between continuous-time SGD dynamics, PL-type behavior, and NTK-inspired conditions in the population-risk setting, offering theoretical justification for efficient learning in deep networks under stochastic optimization.

Abstract

We study a continuous-time approximation of the stochastic gradient descent process for minimizing the population expected loss in learning problems. The main results establish general sufficient conditions for the convergence, extending the results of Chatterjee (2022) established for (nonstochastic) gradient descent. We show how the main result can be applied to the case of overparametrized neural network training.

Convergence of continuous-time stochastic gradient descent with applications to deep neural networks

TL;DR

The paper investigates the convergence of a continuous-time stochastic gradient descent model, expressed as , to global minima of the population loss under stochastic noise. Building on Chatterjee's deterministic criteria, it develops an Itô-calculus framework with stopping times and Lyapunov-type quantities, yielding explicit conditions on and (via ) under which convergence occurs with positive probability and at an exponential rate when initialized near a global minimum. The results are specialized to overparameterized neural networks, showing that with smooth activations and bounded inputs one can satisfy the required inequalities, and that high-probability convergence to the global minimum set can be achieved with sufficiently large final-layer weights and small noise. The work thus provides a rigorous link between continuous-time SGD dynamics, PL-type behavior, and NTK-inspired conditions in the population-risk setting, offering theoretical justification for efficient learning in deep networks under stochastic optimization.

Abstract

We study a continuous-time approximation of the stochastic gradient descent process for minimizing the population expected loss in learning problems. The main results establish general sufficient conditions for the convergence, extending the results of Chatterjee (2022) established for (nonstochastic) gradient descent. We show how the main result can be applied to the case of overparametrized neural network training.
Paper Structure (9 sections, 8 theorems, 109 equations)

This paper contains 9 sections, 8 theorems, 109 equations.

Key Result

Lemma 3

Consider the sde(w2) initialized at some $w_0 \in \mathbb{R}^D$, and suppose that Assumptions a1 and a1B hold. If for some $t \in [0,T)$ we have $f(w_t)=0$, then $T=\infty$ and for all $s > t$, $w_s = w_t$.

Theorems & Definitions (17)

  • Lemma 3
  • Theorem 4: Multi-dimensional Itô formula
  • Remark 5
  • Lemma 6
  • Lemma 7
  • Lemma 8
  • Theorem 9
  • Remark 10
  • Remark 11
  • Remark 12
  • ...and 7 more