Table of Contents
Fetching ...

Beta-Scheduling: Momentum from Critical Damping as a Diagnostic and Correction Tool for Neural Network Training

Ivan Pasichnyk

Abstract

Standard neural network training uses constant momentum (typically 0.9), a convention dating to 1964 with limited theoretical justification for its optimality. We derive a time-varying momentum schedule from the critically damped harmonic oscillator: mu(t) = 1 - 2*sqrt(alpha(t)), where alpha(t) is the current learning rate. This beta-schedule requires zero free parameters beyond the existing learning rate schedule. On ResNet-18/CIFAR-10, beta-scheduling delivers 1.9x faster convergence to 90% accuracy compared to constant momentum. More importantly, the per-layer gradient attribution under this schedule produces a cross-optimizer invariant diagnostic: the same three problem layers are identified regardless of whether the model was trained with SGD or Adam (100% overlap). Surgical correction of only these layers fixes 62 misclassifications while retraining only 18% of parameters. A hybrid schedule -- physics momentum for fast early convergence, then constant momentum for the final refinement -- reaches 95% accuracy fastest among five methods tested. The main contribution is not an accuracy improvement but a principled, parameter-free tool for localizing and correcting specific failure modes in trained networks.

Beta-Scheduling: Momentum from Critical Damping as a Diagnostic and Correction Tool for Neural Network Training

Abstract

Standard neural network training uses constant momentum (typically 0.9), a convention dating to 1964 with limited theoretical justification for its optimality. We derive a time-varying momentum schedule from the critically damped harmonic oscillator: mu(t) = 1 - 2*sqrt(alpha(t)), where alpha(t) is the current learning rate. This beta-schedule requires zero free parameters beyond the existing learning rate schedule. On ResNet-18/CIFAR-10, beta-scheduling delivers 1.9x faster convergence to 90% accuracy compared to constant momentum. More importantly, the per-layer gradient attribution under this schedule produces a cross-optimizer invariant diagnostic: the same three problem layers are identified regardless of whether the model was trained with SGD or Adam (100% overlap). Surgical correction of only these layers fixes 62 misclassifications while retraining only 18% of parameters. A hybrid schedule -- physics momentum for fast early convergence, then constant momentum for the final refinement -- reaches 95% accuracy fastest among five methods tested. The main contribution is not an accuracy improvement but a principled, parameter-free tool for localizing and correcting specific failure modes in trained networks.

Paper Structure

This paper contains 62 sections, 4 equations, 3 figures, 17 tables.

Figures (3)

  • Figure 1: Left: cosine annealing learning rate schedule. Right: momentum trajectories for the three methods. The gray dashed line shows the unclamped critical damping curve $\mu_c = 1 - 2\sqrt{\alpha}$. Shaded regions indicate underdamped (red, above curve) and overdamped (blue, below curve) zones. The physics method (green) tracks the critical curve, clamped to $[0.5, 0.99]$.
  • Figure 2: Test accuracy during training. The physics method (green) converges fastest to intermediate thresholds, reaching 90% at epoch 52 versus epoch 100 (baseline) and epoch 144 (1cycle). All methods converge to comparable final accuracy.
  • Figure 3: Damping regime classification across 200 epochs. Red = underdamped ($\mu > \mu_c + 0.05$), green = critically damped ($|\mu - \mu_c| \leq 0.05$), blue = overdamped ($\mu < \mu_c - 0.05$). The baseline is underdamped for 85% of training; physics maintains near-critical damping throughout.