Does SGD Seek Flatness or Sharpness? An Exactly Solvable Model
Yizhou Xu, Pierfrancesco Beneventano, Isaac Chuang, Liu Ziyin
TL;DR
The paper addresses the inconsistent views on whether SGD seeks flat or sharp minima by solving an exactly solvable deep linear network model under a minimal-fluctuation constraint. It shows that SGD effectively minimizes gradient fluctuations through an entropic loss, and that the converged sharpness is dictated by data geometry, specifically the label-noise covariance $\Sigma_\epsilon$ and input covariance $\Sigma_x$, with isotropic noise yielding the flattest minima and anisotropic noise causing sharpening proportional to the condition number $\kappa(\Sigma_\epsilon)$. A closed-form expression for the sharpness at global minima is derived, revealing a unique SGD-determined sharpness value regardless of initialization. The work also demonstrates a clear separation between sharpness and gradient fluctuation, providing a data-geometric explanation for progressive sharpening and validating the theory across nonlinear architectures (MLP, RNN, Transformer). These findings challenge the conventional view that SGD universally favors flat minima and highlight the pivotal role of label-noise anisotropy in practical generalization and optimization behavior.
Abstract
A large body of theory and empirical work hypothesizes a connection between the flatness of a neural network's loss landscape during training and its performance. However, there have been conceptually opposite pieces of evidence regarding when SGD prefers flatter or sharper solutions during training. In this work, we partially but causally clarify the flatness-seeking behavior of SGD by identifying and exactly solving an analytically solvable model that exhibits both flattening and sharpening behavior during training. In this model, the SGD training has no \textit{a priori} preference for flatness, but only a preference for minimal gradient fluctuations. This leads to the insight that, at least within this model, it is data distribution that uniquely determines the sharpness at convergence, and that a flat minimum is preferred if and only if the noise in the labels is isotropic across all output dimensions. When the noise in the labels is anisotropic, the model instead prefers sharpness and can converge to an arbitrarily sharp solution, depending on the imbalance in the noise in the labels spectrum. We reproduce this key insight in controlled settings with different model architectures such as MLP, RNN, and transformers.
