On the Superlinear Relationship between SGD Noise Covariance and Loss Landscape Curvature
Yikuan Zhang, Ning Yang, Yuhai Tu
TL;DR
This work resolves a long-standing tension in understanding SGD noise by arguing that the Fisher-based linear link between noise covariance $\mathbf{C}$ and the Hessian $\mathbf{H}$ is generally invalid in deep learning. Using Activity–Weight Duality (AWD), the authors establish a loss-agnostic relation $\mathbf{C} \propto \mathbb{E}_p[\mathbf{h}_p^2]$, and show that $\mathbf{C}$ and $\mathbf{H}$ approximately commute with diagonal elements obeying a power-law $C_{ii} \propto H_{ii}^{\gamma}$ where $1 \le \gamma \le 2$, determined by per-sample Hessian spectra. Empirically, CE loss yields $\gamma>1$ (superlinear scaling) while MSE is near linear, and AWD captures these exponents with good accuracy; the differences are traced to a correlation between the leading per-sample curvature and its alignment with the global geometry. The results provide a unifying, geometry-aware description of SGD noise, explain why SGD regularizes toward flatter regions, and are supported by extensive experiments across datasets, architectures, and loss functions. The framework is automatic and model-agnostic, offering practical insights into optimization and generalization in deep learning.
Abstract
Stochastic Gradient Descent (SGD) introduces anisotropic noise that is correlated with the local curvature of the loss landscape, thereby biasing optimization toward flat minima. Prior work often assumes an equivalence between the Fisher Information Matrix and the Hessian for negative log-likelihood losses, leading to the claim that the SGD noise covariance $\mathbf{C}$ is proportional to the Hessian $\mathbf{H}$. We show that this assumption holds only under restrictive conditions that are typically violated in deep neural networks. Using the recently discovered Activity--Weight Duality, we find a more general relationship agnostic to the specific loss formulation, showing that $\mathbf{C} \propto \mathbb{E}_p[\mathbf{h}_p^2]$, where $\mathbf{h}_p$ denotes the per-sample Hessian with $\mathbf{H} = \mathbb{E}_p[\mathbf{h}_p]$. As a consequence, $\mathbf{C}$ and $\mathbf{H}$ commute approximately rather than coincide exactly, and their diagonal elements follow an approximate power-law relation $C_{ii} \propto H_{ii}^γ$ with a theoretically bounded exponent $1 \leq γ\leq 2$, determined by per-sample Hessian spectra. Experiments across datasets, architectures, and loss functions validate these bounds, providing a unified characterization of the noise-curvature relationship in deep learning.
