Loss Spike in Training Neural Networks

Xiaolong Li; Zhi-Qin John Xu; Zhongwang Zhang

Loss Spike in Training Neural Networks

Xiaolong Li, Zhi-Qin John Xu, Zhongwang Zhang

TL;DR

This work investigates loss spikes observed during neural network training by introducing the lower-loss-as-sharper (LLAS) structure to explain the ascent phase and a frequency-based mechanism to explain rapid descent. It reframes the flatness-generalization relationship through a frequency perspective, showing that the maximum Hessian eigenvalue $\lambda_{\max}$ captures linear stability but does not fully account for generalization, which depends on multi-directional effects and low-frequency content. The study also uncovers a link between loss spikes and feature condensation, with spikes associated with more condensed weight configurations and a measurable correlation between $\lambda_{\max}$ and condensation. Together, these results offer a nuanced view of training dynamics, highlighting that sharpness metrics alone are insufficient to predict generalization and that spike-induced condensation may contribute to better generalization under certain conditions.

Abstract

In this work, we investigate the mechanism underlying loss spikes observed during neural network training. When the training enters a region with a lower-loss-as-sharper (LLAS) structure, the training becomes unstable, and the loss exponentially increases once the loss landscape is too sharp, resulting in the rapid ascent of the loss spike. The training stabilizes when it finds a flat region. From a frequency perspective, we explain the rapid descent in loss as being primarily influenced by low-frequency components. We observe a deviation in the first eigendirection, which can be reasonably explained by the frequency principle, as low-frequency information is captured rapidly, leading to the rapid descent. Inspired by our analysis of loss spikes, we revisit the link between the maximum eigenvalue of the loss Hessian ($λ_{\mathrm{max}}$), flatness and generalization. We suggest that $λ_{\mathrm{max}}$ is a good measure of sharpness but not a good measure for generalization. Furthermore, we experimentally observe that loss spikes can facilitate condensation, causing input weights to evolve towards the same direction. And our experiments show that there is a correlation (similar trend) between $λ_{\mathrm{max}}$ and condensation. This observation may provide valuable insights for further theoretical research on the relationship between loss spikes, $λ_{\mathrm{max}}$, and generalization.

Loss Spike in Training Neural Networks

TL;DR

captures linear stability but does not fully account for generalization, which depends on multi-directional effects and low-frequency content. The study also uncovers a link between loss spikes and feature condensation, with spikes associated with more condensed weight configurations and a measurable correlation between

and condensation. Together, these results offer a nuanced view of training dynamics, highlighting that sharpness metrics alone are insufficient to predict generalization and that spike-induced condensation may contribute to better generalization under certain conditions.

Abstract

), flatness and generalization. We suggest that

is a good measure of sharpness but not a good measure for generalization. Furthermore, we experimentally observe that loss spikes can facilitate condensation, causing input weights to evolve towards the same direction. And our experiments show that there is a correlation (similar trend) between

and condensation. This observation may provide valuable insights for further theoretical research on the relationship between loss spikes,

, and generalization.

Paper Structure (21 sections, 22 equations, 15 figures, 3 tables)

This paper contains 21 sections, 22 equations, 15 figures, 3 tables.

Introduction
Related works
Loss spike
Preliminary: Linear stability in training quadratic model
Typical loss spike experiments
Lower-loss-as-sharper (LLAS) structure
Frequency perspective for understanding descent stage
Revisit the flatness-generalization picture
Frequency perspective
Difference on each eigendirection
Implications
Loss spike, $\lambda_{\text{max}}$ and condensation
Loss spike experimentally facilitates condensation
the correlation between $\lambda_{\text{max}}$ and the condensation
Conclusion and discussion
...and 6 more sections

Figures (15)

Figure 1: Schematic illustration of an ideal explanation for why flat solutions generalize well keskar2016large.
Figure 2: (a, d, g, h) The loss value (black) and $\lambda_{\rm max}$ (red) vs. training epoch. (b, e) The loss value and $\lambda_{\rm max}$ of a specific epoch interval, which is marked green in (a, d), respectively. (c, f, i) The loss surface and the trajectory of the model parameters along the first two PCA directions. (a, b, c) Two-layer tanh NN with width 20. The sum of the explained variance ratios of the first two PCA directions is 0.9895. (d, e, f) Two-layer ReLU CNN with Max Pooling. The sum of the explained variance ratios of the first two PCA directions is 0.9882. (g, h, i) VGG-11 simonyan2014very with different learning rates. The sum of the explained variance ratios of the first two PCA directions is 0.9999.
Figure 3: (a) The loss surface and the trajectory of the model parameters along the first two PCA directions in the EoS stage. (b) Schematic illustration of LLAS structure in 3D. (c) The loss value and the maximum eigenvalue of the Hessian matrix of a loss spike process of the toy model. (d) The loss surface and the GD trajectory of the two-dimensional parameters of the toy model.
Figure 4: (a) Low-frequency proportion for different low-frequency thresholds. The NN we used is a two-layer tanh NN with width 20. For the random output difference, we calculate the mean value and the error bar with 100 random samples. (b) Train loss and low-frequency proportion for low-frequency threshold = 2 for different epoch.
Figure 5: Two-layer tanh FNN with a width of 500. (a) The variation of the test loss with the eigenvalue index $i$ when eliminating the difference between $\bm{\theta}_{\rm train}$ and $\bm{\theta}_{\rm test}$ in the first $i$ eigendirections. (b) The output difference before and after moving $\bm{\theta}_{\theta_{\rm train}}$ in the first nine eigendirections of its Hessian matrix. Each subset corresponds to the case of one eigendirection.
...and 10 more figures

Theorems & Definitions (4)

Definition 1
Definition 2
Definition 3
Remark 1

Loss Spike in Training Neural Networks

TL;DR

Abstract

Loss Spike in Training Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Figures (15)

Theorems & Definitions (4)