On the Superlinear Relationship between SGD Noise Covariance and Loss Landscape Curvature

Yikuan Zhang; Ning Yang; Yuhai Tu

On the Superlinear Relationship between SGD Noise Covariance and Loss Landscape Curvature

Yikuan Zhang, Ning Yang, Yuhai Tu

TL;DR

This work resolves a long-standing tension in understanding SGD noise by arguing that the Fisher-based linear link between noise covariance $\mathbf{C}$ and the Hessian $\mathbf{H}$ is generally invalid in deep learning. Using Activity–Weight Duality (AWD), the authors establish a loss-agnostic relation $\mathbf{C} \propto \mathbb{E}_p[\mathbf{h}_p^2]$, and show that $\mathbf{C}$ and $\mathbf{H}$ approximately commute with diagonal elements obeying a power-law $C_{ii} \propto H_{ii}^{\gamma}$ where $1 \le \gamma \le 2$, determined by per-sample Hessian spectra. Empirically, CE loss yields $\gamma>1$ (superlinear scaling) while MSE is near linear, and AWD captures these exponents with good accuracy; the differences are traced to a correlation between the leading per-sample curvature and its alignment with the global geometry. The results provide a unifying, geometry-aware description of SGD noise, explain why SGD regularizes toward flatter regions, and are supported by extensive experiments across datasets, architectures, and loss functions. The framework is automatic and model-agnostic, offering practical insights into optimization and generalization in deep learning.

Abstract

Stochastic Gradient Descent (SGD) introduces anisotropic noise that is correlated with the local curvature of the loss landscape, thereby biasing optimization toward flat minima. Prior work often assumes an equivalence between the Fisher Information Matrix and the Hessian for negative log-likelihood losses, leading to the claim that the SGD noise covariance $\mathbf{C}$ is proportional to the Hessian $\mathbf{H}$. We show that this assumption holds only under restrictive conditions that are typically violated in deep neural networks. Using the recently discovered Activity--Weight Duality, we find a more general relationship agnostic to the specific loss formulation, showing that $\mathbf{C} \propto \mathbb{E}_p[\mathbf{h}_p^2]$, where $\mathbf{h}_p$ denotes the per-sample Hessian with $\mathbf{H} = \mathbb{E}_p[\mathbf{h}_p]$. As a consequence, $\mathbf{C}$ and $\mathbf{H}$ commute approximately rather than coincide exactly, and their diagonal elements follow an approximate power-law relation $C_{ii} \propto H_{ii}^γ$ with a theoretically bounded exponent $1 \leq γ\leq 2$, determined by per-sample Hessian spectra. Experiments across datasets, architectures, and loss functions validate these bounds, providing a unified characterization of the noise-curvature relationship in deep learning.

On the Superlinear Relationship between SGD Noise Covariance and Loss Landscape Curvature

TL;DR

This work resolves a long-standing tension in understanding SGD noise by arguing that the Fisher-based linear link between noise covariance

and the Hessian

is generally invalid in deep learning. Using Activity–Weight Duality (AWD), the authors establish a loss-agnostic relation

, and show that

and

approximately commute with diagonal elements obeying a power-law

where

, determined by per-sample Hessian spectra. Empirically, CE loss yields

(superlinear scaling) while MSE is near linear, and AWD captures these exponents with good accuracy; the differences are traced to a correlation between the leading per-sample curvature and its alignment with the global geometry. The results provide a unifying, geometry-aware description of SGD noise, explain why SGD regularizes toward flatter regions, and are supported by extensive experiments across datasets, architectures, and loss functions. The framework is automatic and model-agnostic, offering practical insights into optimization and generalization in deep learning.

Abstract

is proportional to the Hessian

. We show that this assumption holds only under restrictive conditions that are typically violated in deep neural networks. Using the recently discovered Activity--Weight Duality, we find a more general relationship agnostic to the specific loss formulation, showing that

, where

denotes the per-sample Hessian with

. As a consequence,

and

commute approximately rather than coincide exactly, and their diagonal elements follow an approximate power-law relation

with a theoretically bounded exponent

, determined by per-sample Hessian spectra. Experiments across datasets, architectures, and loss functions validate these bounds, providing a unified characterization of the noise-curvature relationship in deep learning.

Paper Structure (55 sections, 6 theorems, 99 equations, 15 figures, 3 tables)

This paper contains 55 sections, 6 theorems, 99 equations, 15 figures, 3 tables.

Introduction
Background and Motivation
The Flaw in Fisher Approximation
Commutativity between $\mathbf{C}$ and $\mathbf{H}$
C-H Relation via Activity-Weight Duality
Matched Sample Pairs and Perturbations
Minimal Activity-Weight Duality in FCL
AWD-Based Gradient Approximation
Spectral Structure of the Noise Covariance
Empirical Power-Law
Universal Bounds on the Scaling Exponent
Derivation of the Bounds
Physical Interpretation
Distinction between CE and MSE
Conclusion and Discussion
...and 40 more sections

Key Result

Lemma 3.2

The closed-form solution to the optimization problem in Definition def:awd is a rank-1 outer product given by:

Figures (15)

Figure 1: Noise–curvature alignment for a CNN trained on CIFAR-10 with cross-entropy loss (top 100 eigen-directions). (a) The empirical covariance matrix $\mathbf{C}$ represented in the Hessian eigenbasis. (b) The scale-invariant correlation matrix $\mathbf{R}_{\text{real}}$, normalized by diagonal elements. (c) The randomized Baseline $\mathbf{R}_{\text{rand}}$, constructed by randomly rotating $\mathbf{C}$ while preserving its eigenvalue spectrum. See Appendix \ref{['fig:cmt_cifar_cnn_cse', 'fig:cmt_mnist_fc_cse', 'fig:cmt_mnist_fc_mse']} for more details and results on additional architectures.
Figure 2: Log-log plot of diagonal elements using the top $1000$ eigenvalues for models trained to convergence ($100\%$ training accuracy for CE and $>95\%$ for MSE). Data points are mean-centered and vertically shifted for visualization; solid lines denote linear fits. (a) Empirical noise covariance versus the Hessian. (b) AWD-derived noise covariance (Eq. \ref{['eq:thm_result']}) versus the Hessian. The dotted and dashed lines correspond to slope $1$ and $2$, respectively.
Figure 3: Log-log plots of the diagonal elements of the resulting Covariance by "suppression experiment" versus the original Covariance.(a, b) Covariance derived from per-sample Hessians retaining only the dominant eigenvalues. (c, d) Covariance derived after replacing the dominant eigenvalues with their mean value. Columns correspond to distinct models: (a, c) MLP on CIFAR-10 (CE loss, $\gamma \approx 1.4$) and (b, d) MLP on MNIST (MSE loss, $\gamma \approx 1$).
Figure 4: Comprehensive Analysis of SGD Noise Structure and Approximations (MLP on MNIST, CE Loss). (a) Evolution of the scaling exponent $\gamma$. The exponent $\gamma$ remains robustly within the interval $[1, 2]$ throughout training, it gradually increases as training progress and tends to deviate from the lower bound 1 when near the global minimum. Notably, in the terminal phase (near the global minimum, indicated by the vertical dashed line), the scaling exponent derived from the raw empirical covariance ($\mathbf{Covar}$) converges to match both the AWD-derived covariance ($\mathbf{C}_{AWD,raw}$) and its fully or partially approximations ($\mathbf{C}^{hh} , \mathbf{C}^{hh,SD} , \mathbf{C}^{hh,SD,WD} , \mathbf{C}^{hh,SD,WD,LI}$, see \ref{['app:awd_approx']} for details), showing that the approximations used in Theorem \ref{['thm:spectral_noise']} are valid. (b) Commutativity error between matrix pairs. The Random baseline (dash-dot line) represents the expected error ($\approx 1.4$) for unrelated matrices. The significantly lower error for the covariances and Hessian indicates that they satisfy an approximate commutation relation. (c) Eigen Alignment, measured as the ratio of the diagonal magnitude to the total magnitude of $\mathbf{C}$ in the eigenbasis of $\mathbf{H}$. High ratios indicate that $\mathbf{C}$ is nearly diagonal in $\mathbf{H}$'s basis, further supporting approximate commutativity. (d) Spearman Rank Correlation between the diagonals of $\mathbf{C}$ and $\mathbf{H}$ in $\mathbf{H}$'s eigenbasis. A value of 1.0 indicates a strict monotonic correspondence between the noise and curvature spectra. (e) Training dynamics showing the loss and accuracy; the model converges to 100% training accuracy around epoch 30. (f) Evolution of Frobenius norms validating the gradient noise approximation, see \ref{['app:awd_approx']} for the definition of these variables. The dominance of the Hessian-weight term ($\mathbf{C}^{hh}$) over the gradient-activity terms ($\mathbf{C}^{hg}, \mathbf{C}^{gg}$) confirms the "Vanishing Gradients" assumption near the global minimum. The convergence of terms ($\mathbf{C}_{AWD,raw}, \mathbf{C}^{hh} , \mathbf{C}^{hh,SD} , \mathbf{C}^{hh,SD,WD}$) validates the Independence of Distinct Samples and Local Isotropy assumptions in Theorem \ref{['thm:spectral_noise']}. (g) Diagonals magnitude of ($\mathbf{C}^{hh}, \mathbf{C}^{hg}, \mathbf{C}^{gg}$) compared to $\mathbf{C}_{AWD,raw}$ at epoch 100 (near the global minimum) vs. descending basis index, providing a detailed view of the dominance of the Hessian term ($\mathbf{C}^{hh}$). (h) Diagonals magnitude of ($\mathbf{C}^{hh}, \mathbf{C}^{hh,SD,WD}$) compared to $\mathbf{C}_{hh,SD}$ at epoch 100 (near the global minimum) vs. descending basis index, further confirming that the independence and isotropy assumptions hold near the global minimum.
Figure 5: Comprehensive Analysis of SGD Noise Structure and Approximations (MLP on MNIST, MSE Loss) (a) Different from the case with CE loss, the exponent $\gamma$ remains 1 when the model is near the global minimum.
...and 10 more figures

Theorems & Definitions (15)

Definition 3.1: Minimal Activity-Weight Duality fengActivityWeightDuality2023
Lemma 3.2: Explicit Solution for AWD fengActivityWeightDuality2023
Lemma 3.3: AWD Gradient Approximation
proof
Theorem 3.4: Spectral Decomposition of SGD Noise
Remark 3.5
Remark 3.6: Dimensional Consistency
Theorem 5.1: Universal Bounds on $\gamma$
Remark 5.2: Empirical Robustness Beyond Local Convexity
Proposition 5.3: Perfect Alignment, $\gamma \to 2$
...and 5 more

On the Superlinear Relationship between SGD Noise Covariance and Loss Landscape Curvature

TL;DR

Abstract

On the Superlinear Relationship between SGD Noise Covariance and Loss Landscape Curvature

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (15)