Table of Contents
Fetching ...

Can Stability be Detrimental? Better Generalization through Gradient Descent Instabilities

Lawrence Wang, Stephen J. Roberts

TL;DR

This work investigates why gradient descent can generalize well even when learning rates exceed the classical stability threshold. By combining a diagonal linear network toy model with extensive full-batch experiments, it shows that instabilities cause rotations of the top Hessian eigenvectors, driving exploration toward flatter regions and producing progressive flattening of the loss landscape. The authors introduce the Rotational Polarity of Eigenvectors concept, demonstrate a phase-transition in generalization benefits tied to large learning rates, and provide empirical evidence of a Goldilocks zone where large, yet not excessive, learning rates yield best generalization. The findings challenge the primacy of sharpness as a sole generalization predictor and offer practical guidance for leveraging implicit regularization via gradient-descent instabilities.

Abstract

Traditional analyses of gradient descent optimization show that, when the largest eigenvalue of the loss Hessian - often referred to as the sharpness - is below a critical learning-rate threshold, then training is 'stable' and training loss decreases monotonically. Recent studies, however, have suggested that the majority of modern deep neural networks achieve good performance despite operating outside this stable regime. In this work, we demonstrate that such instabilities, induced by large learning rates, move model parameters toward flatter regions of the loss landscape. Our crucial insight lies in noting that, during these instabilities, the orientation of the Hessian eigenvectors rotate. This, we conjecture, allows the model to explore regions of the loss landscape that display more desirable geometrical properties for generalization, such as flatness. These rotations are a consequence of network depth, and we prove that for any network with depth > 1, unstable growth in parameters cause rotations in the principal components of the Hessian, which promote exploration of the parameter space away from unstable directions. Our empirical studies reveal an implicit regularization effect in gradient descent with large learning rates operating beyond the stability threshold. We find these lead to excellent generalization performance on modern benchmark datasets.

Can Stability be Detrimental? Better Generalization through Gradient Descent Instabilities

TL;DR

This work investigates why gradient descent can generalize well even when learning rates exceed the classical stability threshold. By combining a diagonal linear network toy model with extensive full-batch experiments, it shows that instabilities cause rotations of the top Hessian eigenvectors, driving exploration toward flatter regions and producing progressive flattening of the loss landscape. The authors introduce the Rotational Polarity of Eigenvectors concept, demonstrate a phase-transition in generalization benefits tied to large learning rates, and provide empirical evidence of a Goldilocks zone where large, yet not excessive, learning rates yield best generalization. The findings challenge the primacy of sharpness as a sole generalization predictor and offer practical guidance for leveraging implicit regularization via gradient-descent instabilities.

Abstract

Traditional analyses of gradient descent optimization show that, when the largest eigenvalue of the loss Hessian - often referred to as the sharpness - is below a critical learning-rate threshold, then training is 'stable' and training loss decreases monotonically. Recent studies, however, have suggested that the majority of modern deep neural networks achieve good performance despite operating outside this stable regime. In this work, we demonstrate that such instabilities, induced by large learning rates, move model parameters toward flatter regions of the loss landscape. Our crucial insight lies in noting that, during these instabilities, the orientation of the Hessian eigenvectors rotate. This, we conjecture, allows the model to explore regions of the loss landscape that display more desirable geometrical properties for generalization, such as flatness. These rotations are a consequence of network depth, and we prove that for any network with depth > 1, unstable growth in parameters cause rotations in the principal components of the Hessian, which promote exploration of the parameter space away from unstable directions. Our empirical studies reveal an implicit regularization effect in gradient descent with large learning rates operating beyond the stability threshold. We find these lead to excellent generalization performance on modern benchmark datasets.

Paper Structure

This paper contains 29 sections, 1 theorem, 18 equations, 13 figures.

Key Result

Lemma 1

For a convex, $l$-smooth function $L(\theta)$, $L(\theta_{t+1}) \leq L(\theta_t) - \eta(1-\frac{\eta l}{2})|{\nabla L(\theta_t)}|^2_2$

Figures (13)

  • Figure 1: Optimization trajectories in a $2$-parameter DLN display rotations. We visualize $4$ our trajectories, initialized at $(-0.1, 10)$ with $z = \Theta^2$ and choosing $\eta \in \{0.001, 0.0095, 0.011, 0.013\}$, where the stability limit (at initialization) is $\eta_\mathrm{eos} = 0.01$. Left: the regimes of $\gamma_\beta$. Middle: the loss landscape. Right: map of eigenvector orientations (y-axis magnitudes amplified for clarity).
  • Figure 2: During instabilities, the sharpest eigenvectors of the Hessian rotate away smoothly and monotonically (Top), while stable training reverts these rotations (Bottom). We track the similarity of the sharpest Hessian eigenvectors across epochs through three instabilities. Left:$L(\theta)$ and $S(\theta)$. Top: similarities of the $k$-th eigenvectors (colored) and of subspaces formed by the top $3$ eigenvectors (black) during instabilities. Bottom: similarity of subspaces formed by the top $3$ eigenvectors to the baseline (black) across various timings (colored) when $\eta$ reduction begins.
  • Figure 3: Parameter growth along the sharpest Hessian eigenvectors leads to exploration of the peripheries of the local minima, driving up $L(\theta)$ and $S(\theta)$ in the process. As the instability develops, the $S(\theta)$ curve undergoes large changes until a flat region is found to enable a return to stability. We show snapshots along the instability cycle taken along the direction of the gradient. The dotted/solid vertical lines indicate the positions of previous/current parameters, respectively.
  • Figure 4: Progressive flattening in fMNIST. We plot the eventual maximum $S(\theta)_\mathrm{max}$ of MLPs trained with a constant large initial learning rate $\eta_0$ before reducing to $\eta_\mathrm{small}=0.01$ at indicated epochs. The larger and longer phase with $\eta_0$, the more we observe a reduction in $S(\theta)_\mathrm{max}$.
  • Figure 5: Generalization performance improves past the stability limit. We train models until completion and plot validation accuracy against the learning rate $\eta_0$. The X/O markers differentiate $\eta_0$s below/above the stability limit (dotted line), and the color spectrum from dark purple to light yellow marks the different learning rates from low to high.
  • ...and 8 more figures

Theorems & Definitions (1)

  • Lemma