Loss Spike in Training Neural Networks
Xiaolong Li, Zhi-Qin John Xu, Zhongwang Zhang
TL;DR
This work investigates loss spikes observed during neural network training by introducing the lower-loss-as-sharper (LLAS) structure to explain the ascent phase and a frequency-based mechanism to explain rapid descent. It reframes the flatness-generalization relationship through a frequency perspective, showing that the maximum Hessian eigenvalue $\lambda_{\max}$ captures linear stability but does not fully account for generalization, which depends on multi-directional effects and low-frequency content. The study also uncovers a link between loss spikes and feature condensation, with spikes associated with more condensed weight configurations and a measurable correlation between $\lambda_{\max}$ and condensation. Together, these results offer a nuanced view of training dynamics, highlighting that sharpness metrics alone are insufficient to predict generalization and that spike-induced condensation may contribute to better generalization under certain conditions.
Abstract
In this work, we investigate the mechanism underlying loss spikes observed during neural network training. When the training enters a region with a lower-loss-as-sharper (LLAS) structure, the training becomes unstable, and the loss exponentially increases once the loss landscape is too sharp, resulting in the rapid ascent of the loss spike. The training stabilizes when it finds a flat region. From a frequency perspective, we explain the rapid descent in loss as being primarily influenced by low-frequency components. We observe a deviation in the first eigendirection, which can be reasonably explained by the frequency principle, as low-frequency information is captured rapidly, leading to the rapid descent. Inspired by our analysis of loss spikes, we revisit the link between the maximum eigenvalue of the loss Hessian ($λ_{\mathrm{max}}$), flatness and generalization. We suggest that $λ_{\mathrm{max}}$ is a good measure of sharpness but not a good measure for generalization. Furthermore, we experimentally observe that loss spikes can facilitate condensation, causing input weights to evolve towards the same direction. And our experiments show that there is a correlation (similar trend) between $λ_{\mathrm{max}}$ and condensation. This observation may provide valuable insights for further theoretical research on the relationship between loss spikes, $λ_{\mathrm{max}}$, and generalization.
