Table of Contents
Fetching ...

On the Superlinear Relationship between SGD Noise Covariance and Loss Landscape Curvature

Yikuan Zhang, Ning Yang, Yuhai Tu

TL;DR

This work resolves a long-standing tension in understanding SGD noise by arguing that the Fisher-based linear link between noise covariance $\mathbf{C}$ and the Hessian $\mathbf{H}$ is generally invalid in deep learning. Using Activity–Weight Duality (AWD), the authors establish a loss-agnostic relation $\mathbf{C} \propto \mathbb{E}_p[\mathbf{h}_p^2]$, and show that $\mathbf{C}$ and $\mathbf{H}$ approximately commute with diagonal elements obeying a power-law $C_{ii} \propto H_{ii}^{\gamma}$ where $1 \le \gamma \le 2$, determined by per-sample Hessian spectra. Empirically, CE loss yields $\gamma>1$ (superlinear scaling) while MSE is near linear, and AWD captures these exponents with good accuracy; the differences are traced to a correlation between the leading per-sample curvature and its alignment with the global geometry. The results provide a unifying, geometry-aware description of SGD noise, explain why SGD regularizes toward flatter regions, and are supported by extensive experiments across datasets, architectures, and loss functions. The framework is automatic and model-agnostic, offering practical insights into optimization and generalization in deep learning.

Abstract

Stochastic Gradient Descent (SGD) introduces anisotropic noise that is correlated with the local curvature of the loss landscape, thereby biasing optimization toward flat minima. Prior work often assumes an equivalence between the Fisher Information Matrix and the Hessian for negative log-likelihood losses, leading to the claim that the SGD noise covariance $\mathbf{C}$ is proportional to the Hessian $\mathbf{H}$. We show that this assumption holds only under restrictive conditions that are typically violated in deep neural networks. Using the recently discovered Activity--Weight Duality, we find a more general relationship agnostic to the specific loss formulation, showing that $\mathbf{C} \propto \mathbb{E}_p[\mathbf{h}_p^2]$, where $\mathbf{h}_p$ denotes the per-sample Hessian with $\mathbf{H} = \mathbb{E}_p[\mathbf{h}_p]$. As a consequence, $\mathbf{C}$ and $\mathbf{H}$ commute approximately rather than coincide exactly, and their diagonal elements follow an approximate power-law relation $C_{ii} \propto H_{ii}^γ$ with a theoretically bounded exponent $1 \leq γ\leq 2$, determined by per-sample Hessian spectra. Experiments across datasets, architectures, and loss functions validate these bounds, providing a unified characterization of the noise-curvature relationship in deep learning.

On the Superlinear Relationship between SGD Noise Covariance and Loss Landscape Curvature

TL;DR

This work resolves a long-standing tension in understanding SGD noise by arguing that the Fisher-based linear link between noise covariance and the Hessian is generally invalid in deep learning. Using Activity–Weight Duality (AWD), the authors establish a loss-agnostic relation , and show that and approximately commute with diagonal elements obeying a power-law where , determined by per-sample Hessian spectra. Empirically, CE loss yields (superlinear scaling) while MSE is near linear, and AWD captures these exponents with good accuracy; the differences are traced to a correlation between the leading per-sample curvature and its alignment with the global geometry. The results provide a unifying, geometry-aware description of SGD noise, explain why SGD regularizes toward flatter regions, and are supported by extensive experiments across datasets, architectures, and loss functions. The framework is automatic and model-agnostic, offering practical insights into optimization and generalization in deep learning.

Abstract

Stochastic Gradient Descent (SGD) introduces anisotropic noise that is correlated with the local curvature of the loss landscape, thereby biasing optimization toward flat minima. Prior work often assumes an equivalence between the Fisher Information Matrix and the Hessian for negative log-likelihood losses, leading to the claim that the SGD noise covariance is proportional to the Hessian . We show that this assumption holds only under restrictive conditions that are typically violated in deep neural networks. Using the recently discovered Activity--Weight Duality, we find a more general relationship agnostic to the specific loss formulation, showing that , where denotes the per-sample Hessian with . As a consequence, and commute approximately rather than coincide exactly, and their diagonal elements follow an approximate power-law relation with a theoretically bounded exponent , determined by per-sample Hessian spectra. Experiments across datasets, architectures, and loss functions validate these bounds, providing a unified characterization of the noise-curvature relationship in deep learning.
Paper Structure (55 sections, 6 theorems, 99 equations, 15 figures, 3 tables)

This paper contains 55 sections, 6 theorems, 99 equations, 15 figures, 3 tables.

Key Result

Lemma 3.2

The closed-form solution to the optimization problem in Definition def:awd is a rank-1 outer product given by:

Figures (15)

  • Figure 1: Noise–curvature alignment for a CNN trained on CIFAR-10 with cross-entropy loss (top 100 eigen-directions). (a) The empirical covariance matrix $\mathbf{C}$ represented in the Hessian eigenbasis. (b) The scale-invariant correlation matrix $\mathbf{R}_{\text{real}}$, normalized by diagonal elements. (c) The randomized Baseline $\mathbf{R}_{\text{rand}}$, constructed by randomly rotating $\mathbf{C}$ while preserving its eigenvalue spectrum. See Appendix \ref{['fig:cmt_cifar_cnn_cse', 'fig:cmt_mnist_fc_cse', 'fig:cmt_mnist_fc_mse']} for more details and results on additional architectures.
  • Figure 2: Log-log plot of diagonal elements using the top $1000$ eigenvalues for models trained to convergence ($100\%$ training accuracy for CE and $>95\%$ for MSE). Data points are mean-centered and vertically shifted for visualization; solid lines denote linear fits. (a) Empirical noise covariance versus the Hessian. (b) AWD-derived noise covariance (Eq. \ref{['eq:thm_result']}) versus the Hessian. The dotted and dashed lines correspond to slope $1$ and $2$, respectively.
  • Figure 3: Log-log plots of the diagonal elements of the resulting Covariance by "suppression experiment" versus the original Covariance.(a, b) Covariance derived from per-sample Hessians retaining only the dominant eigenvalues. (c, d) Covariance derived after replacing the dominant eigenvalues with their mean value. Columns correspond to distinct models: (a, c) MLP on CIFAR-10 (CE loss, $\gamma \approx 1.4$) and (b, d) MLP on MNIST (MSE loss, $\gamma \approx 1$).
  • Figure 4: Comprehensive Analysis of SGD Noise Structure and Approximations (MLP on MNIST, CE Loss). (a) Evolution of the scaling exponent $\gamma$. The exponent $\gamma$ remains robustly within the interval $[1, 2]$ throughout training, it gradually increases as training progress and tends to deviate from the lower bound 1 when near the global minimum. Notably, in the terminal phase (near the global minimum, indicated by the vertical dashed line), the scaling exponent derived from the raw empirical covariance ($\mathbf{Covar}$) converges to match both the AWD-derived covariance ($\mathbf{C}_{AWD,raw}$) and its fully or partially approximations ($\mathbf{C}^{hh} , \mathbf{C}^{hh,SD} , \mathbf{C}^{hh,SD,WD} , \mathbf{C}^{hh,SD,WD,LI}$, see \ref{['app:awd_approx']} for details), showing that the approximations used in Theorem \ref{['thm:spectral_noise']} are valid. (b) Commutativity error between matrix pairs. The Random baseline (dash-dot line) represents the expected error ($\approx 1.4$) for unrelated matrices. The significantly lower error for the covariances and Hessian indicates that they satisfy an approximate commutation relation. (c) Eigen Alignment, measured as the ratio of the diagonal magnitude to the total magnitude of $\mathbf{C}$ in the eigenbasis of $\mathbf{H}$. High ratios indicate that $\mathbf{C}$ is nearly diagonal in $\mathbf{H}$'s basis, further supporting approximate commutativity. (d) Spearman Rank Correlation between the diagonals of $\mathbf{C}$ and $\mathbf{H}$ in $\mathbf{H}$'s eigenbasis. A value of 1.0 indicates a strict monotonic correspondence between the noise and curvature spectra. (e) Training dynamics showing the loss and accuracy; the model converges to 100% training accuracy around epoch 30. (f) Evolution of Frobenius norms validating the gradient noise approximation, see \ref{['app:awd_approx']} for the definition of these variables. The dominance of the Hessian-weight term ($\mathbf{C}^{hh}$) over the gradient-activity terms ($\mathbf{C}^{hg}, \mathbf{C}^{gg}$) confirms the "Vanishing Gradients" assumption near the global minimum. The convergence of terms ($\mathbf{C}_{AWD,raw}, \mathbf{C}^{hh} , \mathbf{C}^{hh,SD} , \mathbf{C}^{hh,SD,WD}$) validates the Independence of Distinct Samples and Local Isotropy assumptions in Theorem \ref{['thm:spectral_noise']}. (g) Diagonals magnitude of ($\mathbf{C}^{hh}, \mathbf{C}^{hg}, \mathbf{C}^{gg}$) compared to $\mathbf{C}_{AWD,raw}$ at epoch 100 (near the global minimum) vs. descending basis index, providing a detailed view of the dominance of the Hessian term ($\mathbf{C}^{hh}$). (h) Diagonals magnitude of ($\mathbf{C}^{hh}, \mathbf{C}^{hh,SD,WD}$) compared to $\mathbf{C}_{hh,SD}$ at epoch 100 (near the global minimum) vs. descending basis index, further confirming that the independence and isotropy assumptions hold near the global minimum.
  • Figure 5: Comprehensive Analysis of SGD Noise Structure and Approximations (MLP on MNIST, MSE Loss) (a) Different from the case with CE loss, the exponent $\gamma$ remains 1 when the model is near the global minimum.
  • ...and 10 more figures

Theorems & Definitions (15)

  • Definition 3.1: Minimal Activity-Weight Duality fengActivityWeightDuality2023
  • Lemma 3.2: Explicit Solution for AWD fengActivityWeightDuality2023
  • Lemma 3.3: AWD Gradient Approximation
  • proof
  • Theorem 3.4: Spectral Decomposition of SGD Noise
  • Remark 3.5
  • Remark 3.6: Dimensional Consistency
  • Theorem 5.1: Universal Bounds on $\gamma$
  • Remark 5.2: Empirical Robustness Beyond Local Convexity
  • Proposition 5.3: Perfect Alignment, $\gamma \to 2$
  • ...and 5 more