Table of Contents
Fetching ...

A Precise Characterization of SGD Stability Using Loss Surface Geometry

Gregory Dexter, Borja Ocejo, Sathiya Keerthi, Aman Gupta, Ayan Acharya, Rajiv Khanna

TL;DR

This work provides a precise characterization of SGD stability near optima by linking linearized dynamics to loss surface geometry. It introduces a Hessian coherence measure $\sigma$ and derives a simple, interpretable divergence condition that depends on the spectrum of the per-example Hessians, the learning rate $\eta$, and the batch size $B$. The results extend beyond mean-squared error to general additively decomposable losses and establish near-optimality of the stability criterion under natural parameter regimes. The authors validate the theory with synthetic experiments that illustrate how Hessian alignment shapes SGD stability and offers guidance on how $\eta$ and $B$ interact with loss geometry. These insights deepen our understanding of implicit regularization via loss surface geometry and may inform hyperparameter strategies in large-scale, overparameterized models.

Abstract

Stochastic Gradient Descent (SGD) stands as a cornerstone optimization algorithm with proven real-world empirical successes but relatively limited theoretical understanding. Recent research has illuminated a key factor contributing to its practical efficacy: the implicit regularization it instigates. Several studies have investigated the linear stability property of SGD in the vicinity of a stationary point as a predictive proxy for sharpness and generalization error in overparameterized neural networks (Wu et al., 2022; Jastrzebski et al., 2019; Cohen et al., 2021). In this paper, we delve deeper into the relationship between linear stability and sharpness. More specifically, we meticulously delineate the necessary and sufficient conditions for linear stability, contingent on hyperparameters of SGD and the sharpness at the optimum. Towards this end, we introduce a novel coherence measure of the loss Hessian that encapsulates pertinent geometric properties of the loss function that are relevant to the linear stability of SGD. It enables us to provide a simplified sufficient condition for identifying linear instability at an optimum. Notably, compared to previous works, our analysis relies on significantly milder assumptions and is applicable for a broader class of loss functions than known before, encompassing not only mean-squared error but also cross-entropy loss.

A Precise Characterization of SGD Stability Using Loss Surface Geometry

TL;DR

This work provides a precise characterization of SGD stability near optima by linking linearized dynamics to loss surface geometry. It introduces a Hessian coherence measure and derives a simple, interpretable divergence condition that depends on the spectrum of the per-example Hessians, the learning rate , and the batch size . The results extend beyond mean-squared error to general additively decomposable losses and establish near-optimality of the stability criterion under natural parameter regimes. The authors validate the theory with synthetic experiments that illustrate how Hessian alignment shapes SGD stability and offers guidance on how and interact with loss geometry. These insights deepen our understanding of implicit regularization via loss surface geometry and may inform hyperparameter strategies in large-scale, overparameterized models.

Abstract

Stochastic Gradient Descent (SGD) stands as a cornerstone optimization algorithm with proven real-world empirical successes but relatively limited theoretical understanding. Recent research has illuminated a key factor contributing to its practical efficacy: the implicit regularization it instigates. Several studies have investigated the linear stability property of SGD in the vicinity of a stationary point as a predictive proxy for sharpness and generalization error in overparameterized neural networks (Wu et al., 2022; Jastrzebski et al., 2019; Cohen et al., 2021). In this paper, we delve deeper into the relationship between linear stability and sharpness. More specifically, we meticulously delineate the necessary and sufficient conditions for linear stability, contingent on hyperparameters of SGD and the sharpness at the optimum. Towards this end, we introduce a novel coherence measure of the loss Hessian that encapsulates pertinent geometric properties of the loss function that are relevant to the linear stability of SGD. It enables us to provide a simplified sufficient condition for identifying linear instability at an optimum. Notably, compared to previous works, our analysis relies on significantly milder assumptions and is applicable for a broader class of loss functions than known before, encompassing not only mean-squared error but also cross-entropy loss.
Paper Structure (18 sections, 6 theorems, 35 equations, 2 figures)

This paper contains 18 sections, 6 theorems, 35 equations, 2 figures.

Key Result

Theorem 1

Let $\{\hat{\mathbf{J}}_i\}_{i\in\mathbb{N}}$ be a sequence of i.i.d. copies of $\hat{\mathbf{J}}$ defined in Definition def:sgd_dynamics. Let $\{\mathbf{H}_i\}_{i \in [n]}$ have coherence measure $\sigma$. If,

Figures (2)

  • Figure 1: The red area indicates where SGD diverges and blue where it does not diverge among parameter pairs $(\sigma, B)$. The solid black line is where the condition of Theorem \ref{['thm:divergence_simple']} attains equality and the dashed line is where the condition of Theorem \ref{['thm:simplified_optimality']} attains equality.
  • Figure 2: The red area indicates where SGD diverges and grey where it does not diverge among parameter pairs $(\eta, B)$. We plot the squared value of $\eta^2$ to make the linear relation between $B$ and $\eta^2$ clearer. The solid black line is where the condition of Theorem \ref{['thm:divergence_simple']} attains equality and the dashed line is where the condition of Theorem \ref{['thm:simplified_optimality']} attains equality.

Theorems & Definitions (10)

  • Definition 1
  • Definition 2
  • Theorem 1
  • Theorem 2
  • Lemma 4.1
  • Definition 3
  • Lemma A.1
  • Lemma A.2
  • Definition 4
  • Theorem 3