Table of Contents
Fetching ...

A Theoretical Analysis of Noise Geometry in Stochastic Gradient Descent

Mingze Wang, Lei Wu

TL;DR

This work provides a theoretical framework for the geometry of SGD noise by introducing two metrics, loss alignment μ(θ) and directional alignment g(θ,v), to quantify how stochastic gradient noise aligns with the local landscape. It proves provable alignment for (over-parameterized) linear models and two-layer networks under sample-size conditions independent of over-parameterization, and shows directional alignment holds across directions with comparable guarantees. The paper also analyzes SGD’s escape from sharp minima, showing escapes preferentially along flat directions and illustrating how cyclical learning rates can leverage this property to reach flatter regions, with supporting experiments on both small and large-scale models. Overall, the results offer a quantitative, model-agnostic view of SGD noise geometry and its role in optimization dynamics and implicit regularization, validated by extensive numerical experiments on linear and deep networks. Key implications include refined understanding of SGD’s implicit bias toward flat minima and guidance for learning-rate schedules that enhance exploration of flatter regions in non-convex landscapes.

Abstract

In this paper, we provide a theoretical study of noise geometry for minibatch stochastic gradient descent (SGD), a phenomenon where noise aligns favorably with the geometry of local landscape. We propose two metrics, derived from analyzing how noise influences the loss and subspace projection dynamics, to quantify the alignment strength. We show that for (over-parameterized) linear models and two-layer nonlinear networks, when measured by these metrics, the alignment can be provably guaranteed under conditions independent of the degree of over-parameterization. To showcase the utility of our noise geometry characterizations, we present a refined analysis of the mechanism by which SGD escapes from sharp minima. We reveal that unlike gradient descent (GD), which escapes along the sharpest directions, SGD tends to escape from flatter directions and cyclical learning rates can exploit this SGD characteristic to navigate more effectively towards flatter regions. Lastly, extensive experiments are provided to support our theoretical findings.

A Theoretical Analysis of Noise Geometry in Stochastic Gradient Descent

TL;DR

This work provides a theoretical framework for the geometry of SGD noise by introducing two metrics, loss alignment μ(θ) and directional alignment g(θ,v), to quantify how stochastic gradient noise aligns with the local landscape. It proves provable alignment for (over-parameterized) linear models and two-layer networks under sample-size conditions independent of over-parameterization, and shows directional alignment holds across directions with comparable guarantees. The paper also analyzes SGD’s escape from sharp minima, showing escapes preferentially along flat directions and illustrating how cyclical learning rates can leverage this property to reach flatter regions, with supporting experiments on both small and large-scale models. Overall, the results offer a quantitative, model-agnostic view of SGD noise geometry and its role in optimization dynamics and implicit regularization, validated by extensive numerical experiments on linear and deep networks. Key implications include refined understanding of SGD’s implicit bias toward flat minima and guidance for learning-rate schedules that enhance exploration of flatter regions in non-convex landscapes.

Abstract

In this paper, we provide a theoretical study of noise geometry for minibatch stochastic gradient descent (SGD), a phenomenon where noise aligns favorably with the geometry of local landscape. We propose two metrics, derived from analyzing how noise influences the loss and subspace projection dynamics, to quantify the alignment strength. We show that for (over-parameterized) linear models and two-layer nonlinear networks, when measured by these metrics, the alignment can be provably guaranteed under conditions independent of the degree of over-parameterization. To showcase the utility of our noise geometry characterizations, we present a refined analysis of the mechanism by which SGD escapes from sharp minima. We reveal that unlike gradient descent (GD), which escapes along the sharpest directions, SGD tends to escape from flatter directions and cyclical learning rates can exploit this SGD characteristic to navigate more effectively towards flatter regions. Lastly, extensive experiments are provided to support our theoretical findings.
Paper Structure (35 sections, 18 theorems, 125 equations, 6 figures)

This paper contains 35 sections, 18 theorems, 125 equations, 6 figures.

Key Result

Theorem 3.2

Consider OLMs and suppose Assumption assumption: input holds. For any $\epsilon,\delta\in(0,1)$, if $n\gtrsim\max\{(d^2\log^2\left( 1/{\epsilon} \right)+\log^2(1/\delta))/\epsilon, (d\log\left( 1/{\epsilon} \right)+\log(1/\delta))/\epsilon^2\},$ then w.p. at least $1-\delta$, it holds for any $\th

Figures (6)

  • Figure 1: The alignment strength $\mu(\theta)$ is close to $1$ for various models across different model sizes. For all experiments, we set $n=5\log(d_{\mathrm{eff}}), d_{\mathrm{eff}}=50$. The input data are drawn from $\mathcal{N}(0,S)$. For isotropic data, $S=I_{50}$; for anisotropic data, $S={\rm diag}(\lambda_1,\dots,\lambda_D)$ with $\lambda_k=1/\sqrt{k}$ for $k\in [D]$ where $D$ is chosen such that $d_{\mathrm{eff}}=50$. The error bar corresponds to the standard deviation over $20$ independent runs. The targets are generated by a linear model, i.e., $y_i=\langle w^*,x_i\rangle$, where $w^*\sim N(0,I_d)$. We compute $\mu(\theta)$ for randomly chosen $\theta$'s.
  • Figure 2: How the components of noise energy in eigen-directions$\{\alpha_k\}_k$ are proportional to the corresponding curvatures $\{\lambda_k\}_k$. $\alpha_k/\lambda_k$ can reflect the directional alignment \ref{['equ: def: strong align, any direction']} along the eigen-directions. (a) Linear models on Gaussian data in the regimes with limited data, where we fix $d=10^4$ and set $n$ accordingly $(n=d/8,n=8\log d)$. (b) 4-layer CNN and 4-layer FNN on CIFAR-10 dataset. For more experimental details, we refer to Appendix \ref{['appendix: exp: setups']}.
  • Figure 3: Comparison of escape directions between SGD and GD. The problem is linear regression and both SGD and GD are initialized near the global minimum by ${w}_0\sim\mathcal{N}( {w}^*,e^{-10}I_d/d)$. To ensure escape, we choose $\eta=1.2/\left\| G \right\|_{\mathrm{F}}$ and $\eta=4/(\lambda_1+\lambda_2)$ for SGD and GD, respectively. Please refer to Appendix \ref{['appendix: exp: setups']} for more experimental details.
  • Figure 4: Visualization of the trajectories of SGD+CLR v.s. GD+CLR for our toy model. Both cases use the same CLR schedule. We can observe that SGD+CLR moves significantly towards flatter region, while GD+CLR only oscillates along the sharpest direction. We have extensively tuned the learning rates for GD+CLR but do not observe significant movement towards flatter region in any case.
  • Figure 5: Three distributions ($\{\lambda_k\}_k$ and $\{\alpha_k\}_k$) for larger-scale neural networks, which reflect the directional alignment \ref{['equ: def: strong align, any direction']} along the eigen directions of the local landscape.
  • ...and 1 more figures

Theorems & Definitions (25)

  • Definition 3.1: Loss alignment
  • Theorem 3.2: OLM
  • Theorem 3.3: Linear model
  • Theorem 3.5: Two-layer network
  • Definition 4.1: Directional alignment
  • Theorem 4.2: OLM
  • Theorem 4.3: Linear model
  • Theorem 5.2: Escape of SGD
  • Proposition 5.3: Escape of GD
  • Lemma C.1: Proposition 2.3 in wu2022does
  • ...and 15 more