A Theoretical Analysis of Noise Geometry in Stochastic Gradient Descent
Mingze Wang, Lei Wu
TL;DR
This work provides a theoretical framework for the geometry of SGD noise by introducing two metrics, loss alignment μ(θ) and directional alignment g(θ,v), to quantify how stochastic gradient noise aligns with the local landscape. It proves provable alignment for (over-parameterized) linear models and two-layer networks under sample-size conditions independent of over-parameterization, and shows directional alignment holds across directions with comparable guarantees. The paper also analyzes SGD’s escape from sharp minima, showing escapes preferentially along flat directions and illustrating how cyclical learning rates can leverage this property to reach flatter regions, with supporting experiments on both small and large-scale models. Overall, the results offer a quantitative, model-agnostic view of SGD noise geometry and its role in optimization dynamics and implicit regularization, validated by extensive numerical experiments on linear and deep networks. Key implications include refined understanding of SGD’s implicit bias toward flat minima and guidance for learning-rate schedules that enhance exploration of flatter regions in non-convex landscapes.
Abstract
In this paper, we provide a theoretical study of noise geometry for minibatch stochastic gradient descent (SGD), a phenomenon where noise aligns favorably with the geometry of local landscape. We propose two metrics, derived from analyzing how noise influences the loss and subspace projection dynamics, to quantify the alignment strength. We show that for (over-parameterized) linear models and two-layer nonlinear networks, when measured by these metrics, the alignment can be provably guaranteed under conditions independent of the degree of over-parameterization. To showcase the utility of our noise geometry characterizations, we present a refined analysis of the mechanism by which SGD escapes from sharp minima. We reveal that unlike gradient descent (GD), which escapes along the sharpest directions, SGD tends to escape from flatter directions and cyclical learning rates can exploit this SGD characteristic to navigate more effectively towards flatter regions. Lastly, extensive experiments are provided to support our theoretical findings.
