Leap+Verify: Regime-Adaptive Speculative Weight Prediction for Accelerating Neural Network Training

Jeremy McEntire

Leap+Verify: Regime-Adaptive Speculative Weight Prediction for Accelerating Neural Network Training

Jeremy McEntire

TL;DR

Leap+Verify addresses accelerating neural network training by exploiting predictability in weight trajectories through regime-adaptive speculative prediction with a verify-then-accept gate. The method detects training regimes via activation cosine similarity, uses three analytic predictors to forecast $K$ steps ahead, and validates predictions with a held-out loss before allowing fast-forwarding; momentum-based extrapolation fails catastrophically, while linear and quadratic finite-difference predictors succeed in transition and stable regimes. A key finding is that larger models spend more time in chaotic regimes, shifting the bottleneck from predictor accuracy to regime availability, while cross-seed results remain highly consistent. The work connects to ASC and speculative decoding, offering a practical, regime-conditioned framework with robust, reproducible evaluations and a clear path for future enhancements and scaling.

Abstract

We introduce Leap+Verify, a framework that applies speculative execution -- predicting future model weights and validating predictions before acceptance -- to accelerate neural network training. Inspired by speculative decoding in language model inference and by the Automatically Scalable Computation (ASC) architecture for program execution, Leap+Verify decomposes training into three dynamically detected regimes (chaotic, transition, stable) using activation-space cosine similarity as a real-time Lyapunov proxy signal. Within each regime, analytic weight predictors (momentum, linear, quadratic extrapolation) attempt to forecast model parameters K training steps ahead; predictions are accepted only when validated against a held-out loss criterion. We evaluate Leap+Verify on GPT-2 124M and Qwen 2.5-1.5B trained on WikiText-103 across five random seeds, sweeping prediction depth K in {5, 10, 25, 50, 75, 100}. Momentum-based prediction (Adam moment extrapolation) fails catastrophically at both scales, with predicted losses exceeding actuals by 100-10,000x -- a universal norm explosion in optimizer-state extrapolation. Finite-difference predictors (linear, quadratic) succeed where momentum fails: at 124M, they achieve 24% strict acceptance at K=5 in stable regimes; at 1.5B, they achieve 37% strict acceptance in transition regimes. The scale-dependent finding is in regime distribution: GPT-2 124M spends 34% of training in stable regime, while Qwen 1.5B spends 64% in chaotic regime and reaches stable in only 0-2 of 40 checkpoints. Larger models are more predictable when predictable, but less often predictable -- the practical bottleneck shifts from predictor accuracy to regime availability. Cross-seed results are highly consistent (less than 1% validation loss variance), and the three-regime framework produces identical phase boundaries (plus or minus 50 steps) across seeds.

Leap+Verify: Regime-Adaptive Speculative Weight Prediction for Accelerating Neural Network Training

TL;DR

Abstract

Leap+Verify: Regime-Adaptive Speculative Weight Prediction for Accelerating Neural Network Training

Authors

TL;DR

Abstract

Table of Contents