Table of Contents
Fetching ...

A Unified Noise-Curvature View of Loss of Trainability

Gunbir Singh Baveja, Alex Lewandowski, Mark Schmidt

TL;DR

This work reframes loss of trainability in continual learning as a failure of two interacting signals: gradient-noise and curvature volatility. By deriving a batch-size–aware gradient-noise bound and a curvature-volatility bound, the authors define a per-layer adaptive safe step-size and implement a simple per-layer LR scheduler that keeps updates within this bound. Across non-stationary task sequences, this approach improves maintained accuracy over existing methods (CReLU, Wasserstein, L2) and yields adaptive step-size trajectories akin to hand-tuned decay schedules without tuning. The theory connects to prior explanations via a unified view and demonstrates practical benefits through extensive experiments and per-layer analyses, highlighting the importance of layer-wise dynamics in avoiding LoT.

Abstract

Loss of trainability refers to a phenomenon in continual learning where parameter updates no longer make progress on the optimization objective, so accuracy stalls or degrades as the learning problem changes over time. In this paper, we analyze loss of trainability through an optimization lens and find that the phenomenon is not reliably predicted by existing individual indicators such as Hessian rank, sharpness level, weight or gradient norms, gradient-to-parameter ratios, and unit-sign entropy. Motivated by our analysis, we introduce two complementary indicators: a batch-size-aware gradient-noise bound and a curvature volatility-controlled bound. We then combine these two indicators into a per-layer adaptive noise threshold on the effective step-size that anticipates trainability behavior. Using this insight, we propose a step-size scheduler that keeps each layer's effective parameter update below this bound, thereby avoiding loss of trainability. We demonstrate that our scheduler can improve the accuracy maintained by previously proposed approaches, such as concatenated ReLU (CReLU), Wasserstein regularizer, and L2 weight decay. Surprisingly, our scheduler produces adaptive step-size trajectories that, without tuning, mirror the manually engineered step-size decay schedules.

A Unified Noise-Curvature View of Loss of Trainability

TL;DR

This work reframes loss of trainability in continual learning as a failure of two interacting signals: gradient-noise and curvature volatility. By deriving a batch-size–aware gradient-noise bound and a curvature-volatility bound, the authors define a per-layer adaptive safe step-size and implement a simple per-layer LR scheduler that keeps updates within this bound. Across non-stationary task sequences, this approach improves maintained accuracy over existing methods (CReLU, Wasserstein, L2) and yields adaptive step-size trajectories akin to hand-tuned decay schedules without tuning. The theory connects to prior explanations via a unified view and demonstrates practical benefits through extensive experiments and per-layer analyses, highlighting the importance of layer-wise dynamics in avoiding LoT.

Abstract

Loss of trainability refers to a phenomenon in continual learning where parameter updates no longer make progress on the optimization objective, so accuracy stalls or degrades as the learning problem changes over time. In this paper, we analyze loss of trainability through an optimization lens and find that the phenomenon is not reliably predicted by existing individual indicators such as Hessian rank, sharpness level, weight or gradient norms, gradient-to-parameter ratios, and unit-sign entropy. Motivated by our analysis, we introduce two complementary indicators: a batch-size-aware gradient-noise bound and a curvature volatility-controlled bound. We then combine these two indicators into a per-layer adaptive noise threshold on the effective step-size that anticipates trainability behavior. Using this insight, we propose a step-size scheduler that keeps each layer's effective parameter update below this bound, thereby avoiding loss of trainability. We demonstrate that our scheduler can improve the accuracy maintained by previously proposed approaches, such as concatenated ReLU (CReLU), Wasserstein regularizer, and L2 weight decay. Surprisingly, our scheduler produces adaptive step-size trajectories that, without tuning, mirror the manually engineered step-size decay schedules.

Paper Structure

This paper contains 41 sections, 33 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Single indicators are incomplete explanations for loss of trainability. In each panel, we compare two configurations (blue vs. orange). If an indicator were reliable, a collapse in the metric (dashed) should consistently predict a collapse in accuracy (solid). However, we observe contradictions; for example, in (a), the L2 model maintains high Hessian rank yet suffers catastrophic accuracy loss, whereas the Wasserstein model maintains accuracy despite similar rank behavior.
  • Figure 2: Sharpness volatility as an indicator. Under L2 regularization, the rise in curvature volatility (solid) consistently precedes and mirrors the collapse in task accuracy (dashed).
  • Figure 3: First layer step-size dynamics when training with L2 weight decay ($\lambda=10^{-3}$). The effective step-size $\alpha_t$ (blue) exhibits upward drift across tasks due to weight-norm growth, which Adam's preconditioning converts into an increased step-size.
  • Figure 4: Diagnosis and Mitigation.Prediction (top): The combined bound $\tilde{\alpha}^*$ (red) most accurately anticipates the onset of accuracy drops (blue) compared to individual metrics ($\alpha^*_{\mathrm{Vol}}$, green; $\alpha^*_g$, orange). Performance (bottom): The per-layer scheduler (blue) consistently restores trainability across all baselines, significantly outperforming vanilla (red) training and task resets (orange).
  • Figure 5: Scheduled step-sizes on the first 4 tasks.
  • ...and 7 more figures