Singular-limit analysis of gradient descent with noise injection

Anna Shalova; André Schlichting; Mark Peletier

Singular-limit analysis of gradient descent with noise injection

Anna Shalova, André Schlichting, Mark Peletier

TL;DR

This work develops a unified, rigorous framework for the limiting dynamics of noisy gradient descent in overparameterized settings where the zero-loss set ${\Gamma}$ forms a manifold. By embedding noise in a general loss $\hat{L}$ with zero-noise consistency $\hat{L}(w,0)=L(w)$ and applying Katzenberger’s theorem, it derives a two-regime description: a fast phase toward ${\Gamma}$ and a slow, either deterministic (non-degenerate) or stochastic (degenerate) evolution along ${\Gamma}$ driven by a regularizer $\mathrm{Reg}(w)=\tfrac12\Delta_\eta\hat{L}(w,0)$. The main results (non-degenerate and degenerate cases) yield explicit limiting dynamics: a constrained gradient flow $\partial_t W = -P_{T_W\Gamma}\nabla_w\mathrm{Reg}(W)$ on ${\Gamma}$, or a constrained SDE on ${\Gamma}$, with time scales $1/(\alpha\sigma^2)$ and $1/(\alpha^2\sigma^2)$ respectively. The framework is illustrated through diverse noise mechanisms (DropConnect, Dropout, minibatching, label noise, SGLD, anti-correlated perturbed GD) and shows, for example, that minibatch SGD may have vanishing regularization on the fast scales, while Dropout/DropConnect induce non-local curvature-based regularization along the zero-loss manifold. These results connect noise structure to implicit regularization and generalization, and offer a precise, generalizable lens for understanding training dynamics in overparameterized models beyond neural networks.

Abstract

We study the limiting dynamics of a large class of noisy gradient descent systems in the overparameterized regime. In this regime the set of global minimizers of the loss is large, and when initialized in a neighbourhood of this zero-loss set a noisy gradient descent algorithm slowly evolves along this set. In some cases this slow evolution has been related to better generalisation properties. We characterize this evolution for the broad class of noisy gradient descent systems in the limit of small step size. Our results show that the structure of the noise affects not just the form of the limiting process, but also the time scale at which the evolution takes place. We apply the theory to Dropout, label noise and classical SGD (minibatching) noise, and show that these evolve on different two time scales. Classical SGD even yields a trivial evolution on both time scales, implying that additional noise is required for regularization. The results are inspired by the training of neural networks, but the theorems apply to noisy gradient descent of any loss that has a non-trivial zero-loss set.

Singular-limit analysis of gradient descent with noise injection

TL;DR

This work develops a unified, rigorous framework for the limiting dynamics of noisy gradient descent in overparameterized settings where the zero-loss set

forms a manifold. By embedding noise in a general loss

with zero-noise consistency

and applying Katzenberger’s theorem, it derives a two-regime description: a fast phase toward

and a slow, either deterministic (non-degenerate) or stochastic (degenerate) evolution along

driven by a regularizer

. The main results (non-degenerate and degenerate cases) yield explicit limiting dynamics: a constrained gradient flow

, or a constrained SDE on

, with time scales

and

respectively. The framework is illustrated through diverse noise mechanisms (DropConnect, Dropout, minibatching, label noise, SGLD, anti-correlated perturbed GD) and shows, for example, that minibatch SGD may have vanishing regularization on the fast scales, while Dropout/DropConnect induce non-local curvature-based regularization along the zero-loss manifold. These results connect noise structure to implicit regularization and generalization, and offer a precise, generalizable lens for understanding training dynamics in overparameterized models beyond neural networks.

Abstract

Paper Structure (40 sections, 16 theorems, 211 equations, 4 figures, 1 table)

This paper contains 40 sections, 16 theorems, 211 equations, 4 figures, 1 table.

Introduction
Noise injection
The non-degenerate case
The degenerate case
Contributions
Related Work
Convergence results to stationary points.
Noise injection: minibatch SGD.
Noise injection: Dropout.
Other types of noise injection.
Evolution along the zero-loss manifold.
Training and flatness of minima.
Notation and Preliminaries
Notation
Problem Setting
...and 25 more sections

Key Result

Proposition 3.9

Let $U$ be a locally attractive neighbourhood (Definition def:LocAttract) of the loss manifold $\Gamma$ satisfying Assumption ass:LossMfd, then there exists $\beta>0$ such that for any $W(0)\in U$ exists $C>0$ with In particular, the constants $C>0$ can be chosen uniformly among $W(0)\in K \cap U$ for any compact set $K$.

Figures (4)

Figure 1: Noisy gradient descent may continue to move after reaching the zero-loss set $\Gamma$. The left-hand panel shows the level curves of a function $L:{\mathbb R}^2\to[0,\infty)$, with the zero-level set $\Gamma$ marked in red. The middle panel shows a gradient-descent evolution, starting at the top, and converging to $\Gamma$. The right-hand panel shows an evolution of the noisy gradient descent \ref{['eq:NGD-intro']} with $\hat{L}(w,\eta) := L(w+\eta)$. See Appendix \ref{['app:details-num-sim']} for more details.
Figure 2: This figure shows the evolution of Fig. \ref{['fig:ex1-intro:sub:NGD-nondeg']} in more detail, and compares it to the solution of the constrained gradient flow. Panel \ref{['fig:ex2-intro:in-contour']} is the same as Fig. \ref{['fig:ex1-intro:sub:NGD-nondeg']}; panels \ref{['fig:ex2-intro:short-time']} and \ref{['fig:ex2-intro:long-time']} show the vertical coordinate over different periods of time (iterations). Panels \ref{['fig:ex2-intro:short-time']} and \ref{['fig:ex2-intro:long-time']} clearly illustrate the difference in time scales between the fast evolution towards $\Gamma$ and the much slower evolution along $\Gamma$. The green circle and lines mark the minimizer of $\Delta_\eta \hat{L}(w,0)$, which in this case equals $\Delta_w L(w)$. The orange curves are the solution of the constrained gradient flow \ref{['eq:fthA:GF']}, started from the first point at which $w_k$ came close to $\Gamma$. More details are given in Appendix \ref{['app:details-num-sim']}.
Figure 3: Non-degenerate (top row) and degenerate evolution (bottom row); note how the evolution is much faster in the non-degenerate top row than in the degenerate bottom row. In both cases $\alpha = 0.1$ and $\eta$ is a scalar centered normal random variable with variance $\sigma^2=0.01$. In the top row, $\hat{L}(w,\eta) = L(w) + \frac{1}{2} a(w)\eta^2$ with $a(w) = 1-0.7 \cos 2w_1$. For this case Theorem \ref{['fth:main1']} applies and gives the regularizer as $\frac{1}{2} \Delta_\eta \hat{L}(w,0) = 2a(w)$. In the bottom row $\hat{L}(w,\eta) = L(w) + \frac{1}{2} |w|^2 \eta$, and Theorem \ref{['fth:main2']} applies, with the limiting evolution \ref{['eq:fthB:SDE']} reduing to the deterministic constrained gradient flow $\partial_t W = -P_\Gamma \nabla_w\frac{1}{4} \log \Delta_w L(W)$. The green circles and lines mark the minimizers of the respective regularizers. More details are given in Appendix \ref{['app:details-num-sim']}.
Figure 4: When the coordinates of the vector $\eta_k$ are correlated, the resulting regularizer $\mathrm{Reg}$ is modified, and the evolution follows a different path along $\Gamma$. In both diagrams each $\eta_k$ is a centered normal two-dimensional random variable with covariance $C$, independent for each $k$; in the left-hand picture $C=\sigma^2 I_2$, implying independence of $\eta_{k,1}$ from $\eta_{k,2}$, while in the right-hand picture $C=\tfrac{1}{2}{\sigma^2}\left(1111\right)$, implying that $\eta_{k,1}$ and $\eta_{k,2}$ are fully correlated.

Theorems & Definitions (64)

Remark 1.1: Projection onto the tangent plane
Example 1.2: The example in Figures \ref{['fig:ex1-intro']} and \ref{['fig:ex2-intro']}
Example 1.3: Bernoulli DropConnect
Remark 1.4
Remark 1.5: Time scale
Remark 1.6: Limiting equation
Example 1.7: Minibatch noise
Example 1.8: Label noise
Definition 3.1: Noisy gradient descent
Remark 3.3: Examples of noise distributions
...and 54 more

Singular-limit analysis of gradient descent with noise injection

TL;DR

Abstract

Singular-limit analysis of gradient descent with noise injection

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (64)