Catapults in SGD: spikes in the training loss and their impact on generalization through feature learning

Libin Zhu; Chaoyue Liu; Adityanarayanan Radhakrishnan; Mikhail Belkin

Catapults in SGD: spikes in the training loss and their impact on generalization through feature learning

Libin Zhu, Chaoyue Liu, Adityanarayanan Radhakrishnan, Mikhail Belkin

TL;DR

The paper addresses why training loss spikes occur during SGD and how these spikes relate to generalization. It shows that spikes are catapult dynamics confined to the top eigen-directions of the Neural Tangent Kernel, observable in both GD and SGD, and that smaller batch sizes increase the number of catapults, improving generalization through alignment with the true Average Gradient Outer Product. Catapults promote feature learning by boosting AGOP alignment, a mechanism supported by experiments across architectures and datasets, including CelebA and CIFAR/SVHN. The findings offer a principled link between optimization dynamics and generalization, suggesting that intentionally inducing catapults via larger learning rates or smaller batches can improve test performance by guiding the model toward AGOP-aligned representations, with the AGOP alignment serving as a robust predictor of generalization across optimizers.

Abstract

In this paper, we first present an explanation regarding the common occurrence of spikes in the training loss when neural networks are trained with stochastic gradient descent (SGD). We provide evidence that the spikes in the training loss of SGD are "catapults", an optimization phenomenon originally observed in GD with large learning rates in [Lewkowycz et al. 2020]. We empirically show that these catapults occur in a low-dimensional subspace spanned by the top eigenvectors of the tangent kernel, for both GD and SGD. Second, we posit an explanation for how catapults lead to better generalization by demonstrating that catapults promote feature learning by increasing alignment with the Average Gradient Outer Product (AGOP) of the true predictor. Furthermore, we demonstrate that a smaller batch size in SGD induces a larger number of catapults, thereby improving AGOP alignment and test performance.

Catapults in SGD: spikes in the training loss and their impact on generalization through feature learning

TL;DR

Abstract

Paper Structure (78 sections, 1 theorem, 13 equations, 34 figures, 3 tables)

This paper contains 78 sections, 1 theorem, 13 equations, 34 figures, 3 tables.

Introduction
Optimization.
Generalization.
Related works
Linear dynamics and catapult phase phenomenon.
Edge of stability.
Generalization and sharpness.
Preliminaries
Notation.
Optimization task.
Neural Tangent Kernel (NTK).
Top-eigenspace and decomposition of the loss.
Critical learning rate.
Catapult dynamics.
Catapults in optimization
...and 63 more sections

Key Result

Lemma 1

For a smooth loss ${\mathcal{L}}({\mathbf{w}}):\mathbb{R}^p\rightarrow \mathbb{R}$, suppose $\lambda_{\max}(H_{\mathcal{L}}({\mathbf{w}})) \leq \beta$ for all ${\mathbf{w}}\in \mathbb{R}^p$, then GD satisfies:

Figures (34)

Figure 1: Spikes in training loss when optimized using SGD (x-axis: iteration). (Source: https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
Figure 2: An illustration of the catapult. This experiment corresponds to Fig. \ref{['fig:cata_fcn']}a.
Figure 3: Catapult occurring in the top eigenspace of NTK in GD for 5-layer FCN (a) and CNN (b). The training loss is decomposed into the eigenspace of NTK, i.e., ${\mathcal{L}} = {\mathcal{L}}_{\leq 5} + {\mathcal{L}}_{>5}$. In the experiment, both networks are trained by GD on $128$ data points from CIFAR-10 with learning rate $6$ and $8$ respectively (the critical learning rates are $3.6$ for FCN and $4.5$ for CNN).
Figure 4: Multiple catapults during GD with increased learning rates. We train a 5-layer FCN and CNN on a subset of CIFAR-10 using GD. The learning rate is increased two times for each experiment. The experimental details can be found in Appendix \ref{['exp:muli_cata']}.
Figure 5: Exact match between the occasion when $\eta>{\eta_{\mathrm{crit}}}(X_{\mathrm{batch}})$ and loss spike for SGD. We train a two-layer neural network on a synthetic dataset using SGD with batch size one.
...and 29 more figures

Theorems & Definitions (8)

Definition 1: (Neural) Tangent Kernel
Lemma 1: Descent Lemma nesterov1983method
Claim 1
Remark 1
Remark 2: Top eigenspace accounts for the sharp loss spikes in SGD
Remark 3: Catapults in SGD with cyclical learning rate schedule
Remark 4
Remark 5

Catapults in SGD: spikes in the training loss and their impact on generalization through feature learning

TL;DR

Abstract

Catapults in SGD: spikes in the training loss and their impact on generalization through feature learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (34)

Theorems & Definitions (8)