Table of Contents
Fetching ...

Singular-limit analysis of gradient descent with noise injection

Anna Shalova, André Schlichting, Mark Peletier

TL;DR

This work develops a unified, rigorous framework for the limiting dynamics of noisy gradient descent in overparameterized settings where the zero-loss set ${\Gamma}$ forms a manifold. By embedding noise in a general loss $\hat{L}$ with zero-noise consistency $\hat{L}(w,0)=L(w)$ and applying Katzenberger’s theorem, it derives a two-regime description: a fast phase toward ${\Gamma}$ and a slow, either deterministic (non-degenerate) or stochastic (degenerate) evolution along ${\Gamma}$ driven by a regularizer $\mathrm{Reg}(w)=\tfrac12\Delta_\eta\hat{L}(w,0)$. The main results (non-degenerate and degenerate cases) yield explicit limiting dynamics: a constrained gradient flow $\partial_t W = -P_{T_W\Gamma}\nabla_w\mathrm{Reg}(W)$ on ${\Gamma}$, or a constrained SDE on ${\Gamma}$, with time scales $1/(\alpha\sigma^2)$ and $1/(\alpha^2\sigma^2)$ respectively. The framework is illustrated through diverse noise mechanisms (DropConnect, Dropout, minibatching, label noise, SGLD, anti-correlated perturbed GD) and shows, for example, that minibatch SGD may have vanishing regularization on the fast scales, while Dropout/DropConnect induce non-local curvature-based regularization along the zero-loss manifold. These results connect noise structure to implicit regularization and generalization, and offer a precise, generalizable lens for understanding training dynamics in overparameterized models beyond neural networks.

Abstract

We study the limiting dynamics of a large class of noisy gradient descent systems in the overparameterized regime. In this regime the set of global minimizers of the loss is large, and when initialized in a neighbourhood of this zero-loss set a noisy gradient descent algorithm slowly evolves along this set. In some cases this slow evolution has been related to better generalisation properties. We characterize this evolution for the broad class of noisy gradient descent systems in the limit of small step size. Our results show that the structure of the noise affects not just the form of the limiting process, but also the time scale at which the evolution takes place. We apply the theory to Dropout, label noise and classical SGD (minibatching) noise, and show that these evolve on different two time scales. Classical SGD even yields a trivial evolution on both time scales, implying that additional noise is required for regularization. The results are inspired by the training of neural networks, but the theorems apply to noisy gradient descent of any loss that has a non-trivial zero-loss set.

Singular-limit analysis of gradient descent with noise injection

TL;DR

This work develops a unified, rigorous framework for the limiting dynamics of noisy gradient descent in overparameterized settings where the zero-loss set forms a manifold. By embedding noise in a general loss with zero-noise consistency and applying Katzenberger’s theorem, it derives a two-regime description: a fast phase toward and a slow, either deterministic (non-degenerate) or stochastic (degenerate) evolution along driven by a regularizer . The main results (non-degenerate and degenerate cases) yield explicit limiting dynamics: a constrained gradient flow on , or a constrained SDE on , with time scales and respectively. The framework is illustrated through diverse noise mechanisms (DropConnect, Dropout, minibatching, label noise, SGLD, anti-correlated perturbed GD) and shows, for example, that minibatch SGD may have vanishing regularization on the fast scales, while Dropout/DropConnect induce non-local curvature-based regularization along the zero-loss manifold. These results connect noise structure to implicit regularization and generalization, and offer a precise, generalizable lens for understanding training dynamics in overparameterized models beyond neural networks.

Abstract

We study the limiting dynamics of a large class of noisy gradient descent systems in the overparameterized regime. In this regime the set of global minimizers of the loss is large, and when initialized in a neighbourhood of this zero-loss set a noisy gradient descent algorithm slowly evolves along this set. In some cases this slow evolution has been related to better generalisation properties. We characterize this evolution for the broad class of noisy gradient descent systems in the limit of small step size. Our results show that the structure of the noise affects not just the form of the limiting process, but also the time scale at which the evolution takes place. We apply the theory to Dropout, label noise and classical SGD (minibatching) noise, and show that these evolve on different two time scales. Classical SGD even yields a trivial evolution on both time scales, implying that additional noise is required for regularization. The results are inspired by the training of neural networks, but the theorems apply to noisy gradient descent of any loss that has a non-trivial zero-loss set.
Paper Structure (40 sections, 16 theorems, 211 equations, 4 figures, 1 table)

This paper contains 40 sections, 16 theorems, 211 equations, 4 figures, 1 table.

Key Result

Proposition 3.9

Let $U$ be a locally attractive neighbourhood (Definition def:LocAttract) of the loss manifold $\Gamma$ satisfying Assumption ass:LossMfd, then there exists $\beta>0$ such that for any $W(0)\in U$ exists $C>0$ with In particular, the constants $C>0$ can be chosen uniformly among $W(0)\in K \cap U$ for any compact set $K$.

Figures (4)

  • Figure 1: Noisy gradient descent may continue to move after reaching the zero-loss set $\Gamma$. The left-hand panel shows the level curves of a function $L:{\mathbb R}^2\to[0,\infty)$, with the zero-level set $\Gamma$ marked in red. The middle panel shows a gradient-descent evolution, starting at the top, and converging to $\Gamma$. The right-hand panel shows an evolution of the noisy gradient descent \ref{['eq:NGD-intro']} with $\hat{L}(w,\eta) := L(w+\eta)$. See Appendix \ref{['app:details-num-sim']} for more details.
  • Figure 2: This figure shows the evolution of Fig. \ref{['fig:ex1-intro:sub:NGD-nondeg']} in more detail, and compares it to the solution of the constrained gradient flow. Panel \ref{['fig:ex2-intro:in-contour']} is the same as Fig. \ref{['fig:ex1-intro:sub:NGD-nondeg']}; panels \ref{['fig:ex2-intro:short-time']} and \ref{['fig:ex2-intro:long-time']} show the vertical coordinate over different periods of time (iterations). Panels \ref{['fig:ex2-intro:short-time']} and \ref{['fig:ex2-intro:long-time']} clearly illustrate the difference in time scales between the fast evolution towards $\Gamma$ and the much slower evolution along $\Gamma$. The green circle and lines mark the minimizer of $\Delta_\eta \hat{L}(w,0)$, which in this case equals $\Delta_w L(w)$. The orange curves are the solution of the constrained gradient flow \ref{['eq:fthA:GF']}, started from the first point at which $w_k$ came close to $\Gamma$. More details are given in Appendix \ref{['app:details-num-sim']}.
  • Figure 3: Non-degenerate (top row) and degenerate evolution (bottom row); note how the evolution is much faster in the non-degenerate top row than in the degenerate bottom row. In both cases $\alpha = 0.1$ and $\eta$ is a scalar centered normal random variable with variance $\sigma^2=0.01$. In the top row, $\hat{L}(w,\eta) = L(w) + \frac{1}{2} a(w)\eta^2$ with $a(w) = 1-0.7 \cos 2w_1$. For this case Theorem \ref{['fth:main1']} applies and gives the regularizer as $\frac{1}{2} \Delta_\eta \hat{L}(w,0) = 2a(w)$. In the bottom row $\hat{L}(w,\eta) = L(w) + \frac{1}{2} |w|^2 \eta$, and Theorem \ref{['fth:main2']} applies, with the limiting evolution \ref{['eq:fthB:SDE']} reduing to the deterministic constrained gradient flow $\partial_t W = -P_\Gamma \nabla_w\frac{1}{4} \log \Delta_w L(W)$. The green circles and lines mark the minimizers of the respective regularizers. More details are given in Appendix \ref{['app:details-num-sim']}.
  • Figure 4: When the coordinates of the vector $\eta_k$ are correlated, the resulting regularizer $\mathrm{Reg}$ is modified, and the evolution follows a different path along $\Gamma$. In both diagrams each $\eta_k$ is a centered normal two-dimensional random variable with covariance $C$, independent for each $k$; in the left-hand picture $C=\sigma^2 I_2$, implying independence of $\eta_{k,1}$ from $\eta_{k,2}$, while in the right-hand picture $C=\tfrac{1}{2}{\sigma^2}\left(1111\right)$, implying that $\eta_{k,1}$ and $\eta_{k,2}$ are fully correlated.

Theorems & Definitions (64)

  • Remark 1.1: Projection onto the tangent plane
  • Example 1.2: The example in Figures \ref{['fig:ex1-intro']} and \ref{['fig:ex2-intro']}
  • Example 1.3: Bernoulli DropConnect
  • Remark 1.4
  • Remark 1.5: Time scale
  • Remark 1.6: Limiting equation
  • Example 1.7: Minibatch noise
  • Example 1.8: Label noise
  • Definition 3.1: Noisy gradient descent
  • Remark 3.3: Examples of noise distributions
  • ...and 54 more