Catapults in SGD: spikes in the training loss and their impact on generalization through feature learning
Libin Zhu, Chaoyue Liu, Adityanarayanan Radhakrishnan, Mikhail Belkin
TL;DR
The paper addresses why training loss spikes occur during SGD and how these spikes relate to generalization. It shows that spikes are catapult dynamics confined to the top eigen-directions of the Neural Tangent Kernel, observable in both GD and SGD, and that smaller batch sizes increase the number of catapults, improving generalization through alignment with the true Average Gradient Outer Product. Catapults promote feature learning by boosting AGOP alignment, a mechanism supported by experiments across architectures and datasets, including CelebA and CIFAR/SVHN. The findings offer a principled link between optimization dynamics and generalization, suggesting that intentionally inducing catapults via larger learning rates or smaller batches can improve test performance by guiding the model toward AGOP-aligned representations, with the AGOP alignment serving as a robust predictor of generalization across optimizers.
Abstract
In this paper, we first present an explanation regarding the common occurrence of spikes in the training loss when neural networks are trained with stochastic gradient descent (SGD). We provide evidence that the spikes in the training loss of SGD are "catapults", an optimization phenomenon originally observed in GD with large learning rates in [Lewkowycz et al. 2020]. We empirically show that these catapults occur in a low-dimensional subspace spanned by the top eigenvectors of the tangent kernel, for both GD and SGD. Second, we posit an explanation for how catapults lead to better generalization by demonstrating that catapults promote feature learning by increasing alignment with the Average Gradient Outer Product (AGOP) of the true predictor. Furthermore, we demonstrate that a smaller batch size in SGD induces a larger number of catapults, thereby improving AGOP alignment and test performance.
