Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model
Blake Bordelon, Francesco Mori
TL;DR
This paper addresses the challenge of choosing learning-rate schedules that transfer across training horizons in SGD by analyzing a solvable power-law random-feature model. Using optimal-control theory on a reduced SGD dynamics, the authors derive horizon-dependent schedules, identifying easy ($b<a$) and hard ($b>a$) phases, with polynomial-decay schedules in the easy phase and warmup-stable-decay in the hard phase. They establish compute- and data-budget scaling laws, extend the framework to joint optimization with batch size and momentum, and demonstrate that their optimal schedules outperform constant and naive power-law baselines. Experiments on CIFAR-5M with ResNets corroborate horizon-transfer ideas and illustrate practical benefits of annealing in deep networks, while acknowledging limitations of the linear-model setting and the need for broader validation.
Abstract
Setting the learning rate for a deep learning model is a critical part of successful training, yet choosing this hyperparameter is often done empirically with trial and error. In this work, we explore a solvable model of optimal learning rate schedules for a powerlaw random feature model trained with stochastic gradient descent (SGD). We consider the optimal schedule $η_T^\star(t)$ where $t$ is the current iterate and $T$ is the total training horizon. This schedule is computed both numerically and analytically (when possible) using optimal control methods. Our analysis reveals two regimes which we term the easy phase and hard phase. In the easy phase the optimal schedule is a polynomial decay $η_T^\star(t) \simeq T^{-ξ} (1-t/T)^δ$ where $ξ$ and $δ$ depend on the properties of the features and task. In the hard phase, the optimal schedule resembles warmup-stable-decay with constant (in $T$) initial learning rate and annealing performed over a vanishing (in $T$) fraction of training steps. We investigate joint optimization of learning rate and batch size, identifying a degenerate optimality condition. Our model also predicts the compute-optimal scaling laws (where model size and training steps are chosen optimally) in both easy and hard regimes. Going beyond SGD, we consider optimal schedules for the momentum $β(t)$, where speedups in the hard phase are possible. We compare our optimal schedule to various benchmarks in our task including (1) optimal constant learning rates $η_T(t) \sim T^{-ξ}$ (2) optimal power laws $η_T(t) \sim T^{-ξ} t^{-χ}$, finding that our schedule achieves better rates than either of these. Our theory suggests that learning rate transfer across training horizon depends on the structure of the model and task. We explore these ideas in simple experimental pretraining setups.
