Can Stability be Detrimental? Better Generalization through Gradient Descent Instabilities
Lawrence Wang, Stephen J. Roberts
TL;DR
This work investigates why gradient descent can generalize well even when learning rates exceed the classical stability threshold. By combining a diagonal linear network toy model with extensive full-batch experiments, it shows that instabilities cause rotations of the top Hessian eigenvectors, driving exploration toward flatter regions and producing progressive flattening of the loss landscape. The authors introduce the Rotational Polarity of Eigenvectors concept, demonstrate a phase-transition in generalization benefits tied to large learning rates, and provide empirical evidence of a Goldilocks zone where large, yet not excessive, learning rates yield best generalization. The findings challenge the primacy of sharpness as a sole generalization predictor and offer practical guidance for leveraging implicit regularization via gradient-descent instabilities.
Abstract
Traditional analyses of gradient descent optimization show that, when the largest eigenvalue of the loss Hessian - often referred to as the sharpness - is below a critical learning-rate threshold, then training is 'stable' and training loss decreases monotonically. Recent studies, however, have suggested that the majority of modern deep neural networks achieve good performance despite operating outside this stable regime. In this work, we demonstrate that such instabilities, induced by large learning rates, move model parameters toward flatter regions of the loss landscape. Our crucial insight lies in noting that, during these instabilities, the orientation of the Hessian eigenvectors rotate. This, we conjecture, allows the model to explore regions of the loss landscape that display more desirable geometrical properties for generalization, such as flatness. These rotations are a consequence of network depth, and we prove that for any network with depth > 1, unstable growth in parameters cause rotations in the principal components of the Hessian, which promote exploration of the parameter space away from unstable directions. Our empirical studies reveal an implicit regularization effect in gradient descent with large learning rates operating beyond the stability threshold. We find these lead to excellent generalization performance on modern benchmark datasets.
