What Happens after SGD Reaches Zero Loss? --A Mathematical Framework
Zhiyuan Li, Tianhao Wang, Sanjeev Arora
TL;DR
This work develops a global diffusion framework to analyze the implicit bias of SGD in heavily overparametrized models by projecting dynamics onto the manifold $\Gamma$ of local minimizers. Building on Katzenberger's theory, it derives a limiting SDE for $\Theta(\eta^{-2})$ steps with arbitrary noise covariance, revealing how tangent noise and Itô corrections shape the slow evolution along $\Gamma$ through tangent regularization and mixed/normal regularization terms. In particular, label-noise SGD is shown to induce a regularizer-driven flow on $\Gamma$ that can recover sparse groundtruth in overparametrized linear models with sample complexity $O(\kappa \log d)$, outperforming gradient descent in the kernel regime which requires $\Omega(d)$. The framework thus explains how stochasticity can promote generalization by guiding optimization toward low-complexity solutions, and it provides a principled avenue to analyze other stochastic optimizers through similar diffusion limits.
Abstract
Understanding the implicit bias of Stochastic Gradient Descent (SGD) is one of the key challenges in deep learning, especially for overparametrized models, where the local minimizers of the loss function $L$ can form a manifold. Intuitively, with a sufficiently small learning rate $η$, SGD tracks Gradient Descent (GD) until it gets close to such manifold, where the gradient noise prevents further convergence. In such a regime, Blanc et al. (2020) proved that SGD with label noise locally decreases a regularizer-like term, the sharpness of loss, $\mathrm{tr}[\nabla^2 L]$. The current paper gives a general framework for such analysis by adapting ideas from Katzenberger (1991). It allows in principle a complete characterization for the regularization effect of SGD around such manifold -- i.e., the "implicit bias" -- using a stochastic differential equation (SDE) describing the limiting dynamics of the parameters, which is determined jointly by the loss function and the noise covariance. This yields some new results: (1) a global analysis of the implicit bias valid for $η^{-2}$ steps, in contrast to the local analysis of Blanc et al. (2020) that is only valid for $η^{-1.6}$ steps and (2) allowing arbitrary noise covariance. As an application, we show with arbitrary large initialization, label noise SGD can always escape the kernel regime and only requires $O(κ\ln d)$ samples for learning an $κ$-sparse overparametrized linear model in $\mathbb{R}^d$ (Woodworth et al., 2020), while GD initialized in the kernel regime requires $Ω(d)$ samples. This upper bound is minimax optimal and improves the previous $\tilde{O}(κ^2)$ upper bound (HaoChen et al., 2020).
