Table of Contents
Fetching ...

Does SGD Seek Flatness or Sharpness? An Exactly Solvable Model

Yizhou Xu, Pierfrancesco Beneventano, Isaac Chuang, Liu Ziyin

TL;DR

The paper addresses the inconsistent views on whether SGD seeks flat or sharp minima by solving an exactly solvable deep linear network model under a minimal-fluctuation constraint. It shows that SGD effectively minimizes gradient fluctuations through an entropic loss, and that the converged sharpness is dictated by data geometry, specifically the label-noise covariance $\Sigma_\epsilon$ and input covariance $\Sigma_x$, with isotropic noise yielding the flattest minima and anisotropic noise causing sharpening proportional to the condition number $\kappa(\Sigma_\epsilon)$. A closed-form expression for the sharpness at global minima is derived, revealing a unique SGD-determined sharpness value regardless of initialization. The work also demonstrates a clear separation between sharpness and gradient fluctuation, providing a data-geometric explanation for progressive sharpening and validating the theory across nonlinear architectures (MLP, RNN, Transformer). These findings challenge the conventional view that SGD universally favors flat minima and highlight the pivotal role of label-noise anisotropy in practical generalization and optimization behavior.

Abstract

A large body of theory and empirical work hypothesizes a connection between the flatness of a neural network's loss landscape during training and its performance. However, there have been conceptually opposite pieces of evidence regarding when SGD prefers flatter or sharper solutions during training. In this work, we partially but causally clarify the flatness-seeking behavior of SGD by identifying and exactly solving an analytically solvable model that exhibits both flattening and sharpening behavior during training. In this model, the SGD training has no \textit{a priori} preference for flatness, but only a preference for minimal gradient fluctuations. This leads to the insight that, at least within this model, it is data distribution that uniquely determines the sharpness at convergence, and that a flat minimum is preferred if and only if the noise in the labels is isotropic across all output dimensions. When the noise in the labels is anisotropic, the model instead prefers sharpness and can converge to an arbitrarily sharp solution, depending on the imbalance in the noise in the labels spectrum. We reproduce this key insight in controlled settings with different model architectures such as MLP, RNN, and transformers.

Does SGD Seek Flatness or Sharpness? An Exactly Solvable Model

TL;DR

The paper addresses the inconsistent views on whether SGD seeks flat or sharp minima by solving an exactly solvable deep linear network model under a minimal-fluctuation constraint. It shows that SGD effectively minimizes gradient fluctuations through an entropic loss, and that the converged sharpness is dictated by data geometry, specifically the label-noise covariance and input covariance , with isotropic noise yielding the flattest minima and anisotropic noise causing sharpening proportional to the condition number . A closed-form expression for the sharpness at global minima is derived, revealing a unique SGD-determined sharpness value regardless of initialization. The work also demonstrates a clear separation between sharpness and gradient fluctuation, providing a data-geometric explanation for progressive sharpening and validating the theory across nonlinear architectures (MLP, RNN, Transformer). These findings challenge the conventional view that SGD universally favors flat minima and highlight the pivotal role of label-noise anisotropy in practical generalization and optimization behavior.

Abstract

A large body of theory and empirical work hypothesizes a connection between the flatness of a neural network's loss landscape during training and its performance. However, there have been conceptually opposite pieces of evidence regarding when SGD prefers flatter or sharper solutions during training. In this work, we partially but causally clarify the flatness-seeking behavior of SGD by identifying and exactly solving an analytically solvable model that exhibits both flattening and sharpening behavior during training. In this model, the SGD training has no \textit{a priori} preference for flatness, but only a preference for minimal gradient fluctuations. This leads to the insight that, at least within this model, it is data distribution that uniquely determines the sharpness at convergence, and that a flat minimum is preferred if and only if the noise in the labels is isotropic across all output dimensions. When the noise in the labels is anisotropic, the model instead prefers sharpness and can converge to an arbitrarily sharp solution, depending on the imbalance in the noise in the labels spectrum. We reproduce this key insight in controlled settings with different model architectures such as MLP, RNN, and transformers.
Paper Structure (39 sections, 13 theorems, 106 equations, 8 figures)

This paper contains 39 sections, 13 theorems, 106 equations, 8 figures.

Key Result

Lemma 1

Assume that $A$ is a symmetric matrix and that for any $x,y,\theta$, $\ell(x,y,e^{\lambda A}\theta)=\ell(x,y,\theta)$. Moreover, assume that $A\mathbb{E}_{x,y}\nabla^2\ell(x,y,\theta)\neq0$. Then, $\limsup_{|\lambda|\to+\infty}|T(e^{\lambda A}\theta)|=+\infty$.

Figures (8)

  • Figure 1: We reproduce the experiment from ziyin2024parameter, where a deep matrix factorization problem trained with SGD converges to the same sharpness, even if the initial sharpness depends strongly on the initialization scale. The theoretical line is the prediction of Theorem \ref{['theo:main']}.
  • Figure 2: Non-isotropic noise in the labels leads to progressive sharpening. Left: linear networks, with the minimal sharpness predicted by \ref{['eq:minT']}. Middle: Two layer ReLU networks trained under a teacher-student setting. Right: Two layer ReLU networks trained on MNIST. Each line is averaged over five trials shown with the standard error. See Appendix \ref{['app:exp']} for experiment details and more experiments.
  • Figure 3: Across all four models, and for both regression and classification tasks, we observe that the learning process yields a sharper solution as the noise in the labels becomes increasingly non-isotropic. We define relative sharpness as the ratio of the Hessian trace under non-isotropic noise to that under isotropic noise. Left: MSE loss. Right: Cross entropy loss.
  • Figure 4: Progressive sharpening with non-isotropic noise in the labels is unique for SGD. The red dotted line denotes the time when we switch to GD.
  • Figure 5: The maximal eigenvalue of the Hessian evolves similarly to the trace. Left: linear networks, with the minimal sharpness predicted by \ref{['eq:max_eig1']}. Middle: Two layer ReLU networks trained under a teacher-student setting. Right: Two layer ReLU networks trained on MNIST. Each line is averaged over five trials shown with the standard error.
  • ...and 3 more figures

Theorems & Definitions (26)

  • Definition 1
  • Lemma 1
  • Lemma 2: Minimal Fluctuation Lemma, informal
  • Theorem 1
  • Corollary 1
  • Corollary 2
  • Corollary 3
  • Theorem 2
  • proof
  • proof
  • ...and 16 more