Singular-limit analysis of gradient descent with noise injection
Anna Shalova, André Schlichting, Mark Peletier
TL;DR
This work develops a unified, rigorous framework for the limiting dynamics of noisy gradient descent in overparameterized settings where the zero-loss set ${\Gamma}$ forms a manifold. By embedding noise in a general loss $\hat{L}$ with zero-noise consistency $\hat{L}(w,0)=L(w)$ and applying Katzenberger’s theorem, it derives a two-regime description: a fast phase toward ${\Gamma}$ and a slow, either deterministic (non-degenerate) or stochastic (degenerate) evolution along ${\Gamma}$ driven by a regularizer $\mathrm{Reg}(w)=\tfrac12\Delta_\eta\hat{L}(w,0)$. The main results (non-degenerate and degenerate cases) yield explicit limiting dynamics: a constrained gradient flow $\partial_t W = -P_{T_W\Gamma}\nabla_w\mathrm{Reg}(W)$ on ${\Gamma}$, or a constrained SDE on ${\Gamma}$, with time scales $1/(\alpha\sigma^2)$ and $1/(\alpha^2\sigma^2)$ respectively. The framework is illustrated through diverse noise mechanisms (DropConnect, Dropout, minibatching, label noise, SGLD, anti-correlated perturbed GD) and shows, for example, that minibatch SGD may have vanishing regularization on the fast scales, while Dropout/DropConnect induce non-local curvature-based regularization along the zero-loss manifold. These results connect noise structure to implicit regularization and generalization, and offer a precise, generalizable lens for understanding training dynamics in overparameterized models beyond neural networks.
Abstract
We study the limiting dynamics of a large class of noisy gradient descent systems in the overparameterized regime. In this regime the set of global minimizers of the loss is large, and when initialized in a neighbourhood of this zero-loss set a noisy gradient descent algorithm slowly evolves along this set. In some cases this slow evolution has been related to better generalisation properties. We characterize this evolution for the broad class of noisy gradient descent systems in the limit of small step size. Our results show that the structure of the noise affects not just the form of the limiting process, but also the time scale at which the evolution takes place. We apply the theory to Dropout, label noise and classical SGD (minibatching) noise, and show that these evolve on different two time scales. Classical SGD even yields a trivial evolution on both time scales, implying that additional noise is required for regularization. The results are inspired by the training of neural networks, but the theorems apply to noisy gradient descent of any loss that has a non-trivial zero-loss set.
