Convergence of continuous-time stochastic gradient descent with applications to deep neural networks
Gabor Lugosi, Eulalia Nualart
TL;DR
The paper investigates the convergence of a continuous-time stochastic gradient descent model, expressed as $\mathrm{d}w_t=-\nabla f(w_t)\,dt+\sqrt{\eta}\,\sigma(w_t)\,dB_t$, to global minima of the population loss $f(w)=\mathbb{E}[\ell(w,Z)]$ under stochastic noise. Building on Chatterjee's deterministic criteria, it develops an Itô-calculus framework with stopping times and Lyapunov-type quantities, yielding explicit conditions on $f$ and $\sigma$ (via $A_{\min}, G_{\max}, B_{\max}, \theta$) under which convergence occurs with positive probability and at an exponential rate when initialized near a global minimum. The results are specialized to overparameterized neural networks, showing that with smooth activations and bounded inputs one can satisfy the required inequalities, and that high-probability convergence to the global minimum set $\mathcal{S}$ can be achieved with sufficiently large final-layer weights and small noise. The work thus provides a rigorous link between continuous-time SGD dynamics, PL-type behavior, and NTK-inspired conditions in the population-risk setting, offering theoretical justification for efficient learning in deep networks under stochastic optimization.
Abstract
We study a continuous-time approximation of the stochastic gradient descent process for minimizing the population expected loss in learning problems. The main results establish general sufficient conditions for the convergence, extending the results of Chatterjee (2022) established for (nonstochastic) gradient descent. We show how the main result can be applied to the case of overparametrized neural network training.
