Table of Contents
Fetching ...

Type-II Saddles and Probabilistic Stability of Stochastic Gradient Descent

Liu Ziyin, Botao Li, Tomer Galanti, Masahito Ueda

TL;DR

The paper addresses SGD dynamics near saddle points in neural networks by distinguishing Type-II saddles, which exhibit vanishing gradient noise and are hard to escape, from Type-I saddles. It introduces probabilistic stability and Lyapunov exponents to analyze attractivity around Type-II saddles, leveraging a random matrix product framework and the Furstenberg–Kesten theorem. The key result is that a negative maximal Lyapunov exponent $\Lambda<0$ is equivalent to probabilistic stability, predicting four distinct learning phases governed by the gradient signal-to-noise ratio, with rich implications for initialization and optimization. Empirical studies on lasso-like problems and deep networks (e.g., ResNet-18 on CIFAR-10) corroborate the theory, suggesting that saddle-point dynamics, not just minima, play a central role in solution selection and that SGD can converge to low-rank saddles under certain conditions.

Abstract

Characterizing and understanding the dynamics of stochastic gradient descent (SGD) around saddle points remains an open problem. We first show that saddle points in neural networks can be divided into two types, among which the Type-II saddles are especially difficult to escape from because the gradient noise vanishes at the saddle. The dynamics of SGD around these saddles are thus to leading order described by a random matrix product process, and it is thus natural to study the dynamics of SGD around these saddles using the notion of probabilistic stability and the related Lyapunov exponent. Theoretically, we link the study of SGD dynamics to well-known concepts in ergodic theory, which we leverage to show that saddle points can be either attractive or repulsive for SGD, and its dynamics can be classified into four different phases, depending on the signal-to-noise ratio in the gradient close to the saddle.

Type-II Saddles and Probabilistic Stability of Stochastic Gradient Descent

TL;DR

The paper addresses SGD dynamics near saddle points in neural networks by distinguishing Type-II saddles, which exhibit vanishing gradient noise and are hard to escape, from Type-I saddles. It introduces probabilistic stability and Lyapunov exponents to analyze attractivity around Type-II saddles, leveraging a random matrix product framework and the Furstenberg–Kesten theorem. The key result is that a negative maximal Lyapunov exponent is equivalent to probabilistic stability, predicting four distinct learning phases governed by the gradient signal-to-noise ratio, with rich implications for initialization and optimization. Empirical studies on lasso-like problems and deep networks (e.g., ResNet-18 on CIFAR-10) corroborate the theory, suggesting that saddle-point dynamics, not just minima, play a central role in solution selection and that SGD can converge to low-rank saddles under certain conditions.

Abstract

Characterizing and understanding the dynamics of stochastic gradient descent (SGD) around saddle points remains an open problem. We first show that saddle points in neural networks can be divided into two types, among which the Type-II saddles are especially difficult to escape from because the gradient noise vanishes at the saddle. The dynamics of SGD around these saddles are thus to leading order described by a random matrix product process, and it is thus natural to study the dynamics of SGD around these saddles using the notion of probabilistic stability and the related Lyapunov exponent. Theoretically, we link the study of SGD dynamics to well-known concepts in ergodic theory, which we leverage to show that saddle points can be either attractive or repulsive for SGD, and its dynamics can be classified into four different phases, depending on the signal-to-noise ratio in the gradient close to the saddle.
Paper Structure (29 sections, 7 theorems, 52 equations, 13 figures)

This paper contains 29 sections, 7 theorems, 52 equations, 13 figures.

Key Result

Theorem 1

Let $\theta_t$ follow Eq. eq: 1d symmetry dynamics. Then, for any distribution of $h(x )$, $n^T(\theta_t - \theta^*) \to_p 0$ if and only if

Figures (13)

  • Figure 1: Escaping from two types of saddles in a ReLU network under SGD. We see that for the escapes from type-I saddles, the escaping process starts immediately for every trajectory. For type-II saddles, the escape only starts significantly after the training starts, despite the gradient noise. The black dashed line shows an exponential fit to the type-II, implying the connection of the problem to Lyapunov exponents. See Appendix \ref{['app sec: figure 1 setting']} for details on the construction of these saddles.
  • Figure 2: SGD exhibits a complex phase diagram through the lens of probabilistic stability. Left: $a$ denotes the parameter in the data distribution, as discussed in detail in section \ref{['sec: phases of learning sgd']}. For a matrix factorization saddle point, the dynamics of SGD can be categorized into at least five different phases. Phase I, II, and IV correspond to a successful escape from the saddle. Phase III is where the model converges to a low-rank saddle point. Phase I corresponds to the case $w_t \to_p u_t$, which signals correct learning. In phase Ia, the model also converges in variance. Phase II corresponds to stable but incorrect learning, where $w_t \to_p -u_t$. Phase IV corresponds to complete instability. Right: the phases of SGD can quantified by the sign of the Lyapunov exponent $\Lambda$. Where $\Lambda<0$, SGD collapses to a saddle point; when $\Lambda >0$, SGD escapes the saddle and enters an escaping phase. The two escaping phases are qualitatively different. For a small learning rate, the model is in a learning phase due to the repulsiveness of the saddle point at a small learning rate, and the model is likely to converge to local minima close to the saddle. For a very large learning rate, SGD escapes the saddle due to the dynamical instability of SGD, and the model will move far away from the saddle. Besides, the magnitude of the Lyapunov exponent can also quantity the speed of the learning dynamics. See Appendix \ref{['app sec: exp detail']} for numerical details of this example.
  • Figure 3: At a small learningrate, the spred algorithm ziyin2023sparsity converges to the ground truth solution of lasso. At a large learning rate, however, it is biased towards sparser solutions than the ground truth because the sparser solutions are Type-II saddles and thus attractive for SGD at a large learning rate.
  • Figure 4: How SGD selects a solution. Left: The landscape of a two-layer network with the swish activation function ramachandran2017searching. The black arrow corresponds to the experimental trajectory and the prediction of probabilistic stability, while the red arrow corresponds to the (false) prediction of the $L_2$ stability. Middle, Right: the generalization performance of the model for different learning rates. Middle: Initialized at solution B, SGD first jumps to C and then diverges. Right: Initialized at A, SGD also jumps to C and diverges. In both cases, the behavior of SGD agrees with the prediction of the probabilistic stability instead of the $L_2$ stability. Instead of jumping between local minima, SGD, at a large learning rate, transitions from minima to saddles.
  • Figure 5: Phase diagram at $N=100$. The colors indicate the phases as in Figure \ref{['fig:first phase diagram']}. At a finite-size, the phase boundaries have a fractal-like structure. The bottom-left of the phase diagram has a smooth boundary and is shared across almost all phase diagrams we plotted.
  • ...and 8 more figures

Theorems & Definitions (16)

  • Definition 1
  • Definition 2
  • Definition 3
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Proposition 1
  • proof
  • proof
  • ...and 6 more