Type-II Saddles and Probabilistic Stability of Stochastic Gradient Descent
Liu Ziyin, Botao Li, Tomer Galanti, Masahito Ueda
TL;DR
The paper addresses SGD dynamics near saddle points in neural networks by distinguishing Type-II saddles, which exhibit vanishing gradient noise and are hard to escape, from Type-I saddles. It introduces probabilistic stability and Lyapunov exponents to analyze attractivity around Type-II saddles, leveraging a random matrix product framework and the Furstenberg–Kesten theorem. The key result is that a negative maximal Lyapunov exponent $\Lambda<0$ is equivalent to probabilistic stability, predicting four distinct learning phases governed by the gradient signal-to-noise ratio, with rich implications for initialization and optimization. Empirical studies on lasso-like problems and deep networks (e.g., ResNet-18 on CIFAR-10) corroborate the theory, suggesting that saddle-point dynamics, not just minima, play a central role in solution selection and that SGD can converge to low-rank saddles under certain conditions.
Abstract
Characterizing and understanding the dynamics of stochastic gradient descent (SGD) around saddle points remains an open problem. We first show that saddle points in neural networks can be divided into two types, among which the Type-II saddles are especially difficult to escape from because the gradient noise vanishes at the saddle. The dynamics of SGD around these saddles are thus to leading order described by a random matrix product process, and it is thus natural to study the dynamics of SGD around these saddles using the notion of probabilistic stability and the related Lyapunov exponent. Theoretically, we link the study of SGD dynamics to well-known concepts in ergodic theory, which we leverage to show that saddle points can be either attractive or repulsive for SGD, and its dynamics can be classified into four different phases, depending on the signal-to-noise ratio in the gradient close to the saddle.
