A Precise Characterization of SGD Stability Using Loss Surface Geometry
Gregory Dexter, Borja Ocejo, Sathiya Keerthi, Aman Gupta, Ayan Acharya, Rajiv Khanna
TL;DR
This work provides a precise characterization of SGD stability near optima by linking linearized dynamics to loss surface geometry. It introduces a Hessian coherence measure $\sigma$ and derives a simple, interpretable divergence condition that depends on the spectrum of the per-example Hessians, the learning rate $\eta$, and the batch size $B$. The results extend beyond mean-squared error to general additively decomposable losses and establish near-optimality of the stability criterion under natural parameter regimes. The authors validate the theory with synthetic experiments that illustrate how Hessian alignment shapes SGD stability and offers guidance on how $\eta$ and $B$ interact with loss geometry. These insights deepen our understanding of implicit regularization via loss surface geometry and may inform hyperparameter strategies in large-scale, overparameterized models.
Abstract
Stochastic Gradient Descent (SGD) stands as a cornerstone optimization algorithm with proven real-world empirical successes but relatively limited theoretical understanding. Recent research has illuminated a key factor contributing to its practical efficacy: the implicit regularization it instigates. Several studies have investigated the linear stability property of SGD in the vicinity of a stationary point as a predictive proxy for sharpness and generalization error in overparameterized neural networks (Wu et al., 2022; Jastrzebski et al., 2019; Cohen et al., 2021). In this paper, we delve deeper into the relationship between linear stability and sharpness. More specifically, we meticulously delineate the necessary and sufficient conditions for linear stability, contingent on hyperparameters of SGD and the sharpness at the optimum. Towards this end, we introduce a novel coherence measure of the loss Hessian that encapsulates pertinent geometric properties of the loss function that are relevant to the linear stability of SGD. It enables us to provide a simplified sufficient condition for identifying linear instability at an optimum. Notably, compared to previous works, our analysis relies on significantly milder assumptions and is applicable for a broader class of loss functions than known before, encompassing not only mean-squared error but also cross-entropy loss.
