Table of Contents
Fetching ...

On the Stability of Nonlinear Dynamics in GD and SGD: Beyond Quadratic Potentials

Rotem Mulayoff, Sebastian U. Stich

TL;DR

This work shows that nonlinear terms fundamentally alter the stability landscape of gradient methods. It derives an exact multivariate criterion for stable oscillations of gradient descent near a minimizer, expressed via high-order derivatives of the loss at the minimum, and clarifies when stable period-2 dynamics occur at the edge of stability. For SGD, it reveals that a single unstable batch can drive divergence in expectation, challenging mean-linear-stability predictions, and provides a Koopman-based framework yielding a practical sufficient condition: if all batches are linearly stable, nonlinear SGD remains stable in a neighborhood of the interpolating minimum. Collectively, the results sharpen our understanding of edge-of-stability phenomena and offer principled guidance on learning-rate and batch strategies beyond quadratic or linearized analyses.

Abstract

The dynamical stability of the iterates during training plays a key role in determining the minima obtained by optimization algorithms. For example, stable solutions of gradient descent (GD) correspond to flat minima, which have been associated with favorable features. While prior work often relies on linearization to determine stability, it remains unclear whether linearized dynamics faithfully capture the full nonlinear behavior. Recent work has shown that GD may stably oscillate near a linearly unstable minimum and still converge once the step size decays, indicating that linear analysis can be misleading. In this work, we explicitly study the effect of nonlinear terms. Specifically, we derive an exact criterion for stable oscillations of GD near minima in the multivariate setting. Our condition depends on high-order derivatives, generalizing existing results. Extending the analysis to stochastic gradient descent (SGD), we show that nonlinear dynamics can diverge in expectation even if a single batch is unstable. This implies that stability can be dictated by a single batch that oscillates unstably, rather than an average effect, as linear analysis suggests. Finally, we prove that if all batches are linearly stable, the nonlinear dynamics of SGD are stable in expectation.

On the Stability of Nonlinear Dynamics in GD and SGD: Beyond Quadratic Potentials

TL;DR

This work shows that nonlinear terms fundamentally alter the stability landscape of gradient methods. It derives an exact multivariate criterion for stable oscillations of gradient descent near a minimizer, expressed via high-order derivatives of the loss at the minimum, and clarifies when stable period-2 dynamics occur at the edge of stability. For SGD, it reveals that a single unstable batch can drive divergence in expectation, challenging mean-linear-stability predictions, and provides a Koopman-based framework yielding a practical sufficient condition: if all batches are linearly stable, nonlinear SGD remains stable in a neighborhood of the interpolating minimum. Collectively, the results sharpen our understanding of edge-of-stability phenomena and offer principled guidance on learning-rate and batch strategies beyond quadratic or linearized analyses.

Abstract

The dynamical stability of the iterates during training plays a key role in determining the minima obtained by optimization algorithms. For example, stable solutions of gradient descent (GD) correspond to flat minima, which have been associated with favorable features. While prior work often relies on linearization to determine stability, it remains unclear whether linearized dynamics faithfully capture the full nonlinear behavior. Recent work has shown that GD may stably oscillate near a linearly unstable minimum and still converge once the step size decays, indicating that linear analysis can be misleading. In this work, we explicitly study the effect of nonlinear terms. Specifically, we derive an exact criterion for stable oscillations of GD near minima in the multivariate setting. Our condition depends on high-order derivatives, generalizing existing results. Extending the analysis to stochastic gradient descent (SGD), we show that nonlinear dynamics can diverge in expectation even if a single batch is unstable. This implies that stability can be dictated by a single batch that oscillates unstably, rather than an average effect, as linear analysis suggests. Finally, we prove that if all batches are linearly stable, the nonlinear dynamics of SGD are stable in expectation.
Paper Structure (27 sections, 8 theorems, 137 equations, 2 figures, 1 table)

This paper contains 27 sections, 8 theorems, 137 equations, 2 figures, 1 table.

Key Result

proposition 1

Let $\{ x_t \}$ be SGD's iterates on $f_+$ and $f_a$ from eq:counter example, s.t. $x_0 \neq 0$. If $\ \eta > 2$ then $\mathbb{E} [ |x_t-x^*| ] \underset{t \to \infty}{\longrightarrow} \infty$.

Figures (2)

  • Figure 1: Stable vs. unstable oscillations near a minimum. We apply GD to $f_{+}$ and $f_{-}$ from \ref{['eq:normal forms of flip bifurcation']} with various step sizes $\eta \in (1,4)$. The resulting dynamics \ref{['eq:normal form in GD dynamics']} correspond to the normal form of a flip bifurcation. Once the step size exceeds the linear stability threshold $\eta_{\mathrm{lin}} = 2$, stability is determined by the sign of the cubic term in the dynamics. Panel fig:Warmup_functions shows $f_{+}$ and $f_{-}$, whose minima share the same sharpness. Panel fig:Warmup_bifurcation1 visualizes GD's output on $f_{+}$ with various step sizes. When the step size $\eta$ crosses $\eta_{\mathrm{lin}}$, the minimum $x^* = 0$ loses stability, resulting in unstable oscillations, which lead to divergence. Panel fig:Warmup_bifurcation2 depicts GD's convergent points on $f_{-}$ for various step sizes. At the threshold, $\eta = \eta_{\mathrm{lin}}$, the minimizer $x^*=0$ loses stability, and the iterates settle into a stable period-2 cycle, which then undergoes period doubling, chaos, and eventually divergence for $\eta > 4$.
  • Figure 2: Demonstration of Thm. \ref{['Thm:Stable oscillations of GD']}. Consider $\mathcal{L}_{\beta} ( x_1, x_2) = \frac{1}{2}x_1^2 +\frac{1}{10}x_2^2 + \beta x_1^2 x_2 + \frac{1}{10} x_1^4$, whose linear stability threshold under GD at the vicinity of the local minimizer $\boldsymbol{x}^* = (0,0)$ is $\eta_{\mathrm{lin}} = 2$. According to Thm. \ref{['Thm:Stable oscillations of GD']}, GD at the edge of stability oscillates stably around $\boldsymbol{x}^*$ if and only if $|\beta| > 0.2$ (see App. \ref{['app:GD analytic example']}). Panel fig:GD_func_graphs plots $\mathcal{L}_{\beta}$ near $\boldsymbol{x}^*$ for $\beta = 0.1$ and $\beta = 0.5$, highlighting the asymmetry introduced by the cubic term. Panel fig:GD Oscillations shows the long-term value $\boldsymbol{x}_T$ across a range of $\beta$. When $|\beta| > 0.2$, GD converges to a stable period-2 cycle, whereas for $|\beta| < 0.2$ the iterates diverge. This validates that condition \ref{['eq:condition for stable oscillations']} precisely captures the transition from stability to instability.

Theorems & Definitions (13)

  • proposition 1: Worst case batch
  • theorem 1: Stable oscillations in GD
  • corollary 1: Sufficient condition for stable oscillations
  • definition 1: Interpolating minimizer
  • theorem 2: Necessary condition
  • theorem 3: Sufficient condition
  • definition 2: Linearized dynamics
  • theorem 4: Univariate linear stability threshold, wu2018sgd
  • definition 3: Operator norm
  • definition 4: Spectrum
  • ...and 3 more