Table of Contents
Fetching ...

Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model

Blake Bordelon, Francesco Mori

TL;DR

This paper addresses the challenge of choosing learning-rate schedules that transfer across training horizons in SGD by analyzing a solvable power-law random-feature model. Using optimal-control theory on a reduced SGD dynamics, the authors derive horizon-dependent schedules, identifying easy ($b<a$) and hard ($b>a$) phases, with polynomial-decay schedules in the easy phase and warmup-stable-decay in the hard phase. They establish compute- and data-budget scaling laws, extend the framework to joint optimization with batch size and momentum, and demonstrate that their optimal schedules outperform constant and naive power-law baselines. Experiments on CIFAR-5M with ResNets corroborate horizon-transfer ideas and illustrate practical benefits of annealing in deep networks, while acknowledging limitations of the linear-model setting and the need for broader validation.

Abstract

Setting the learning rate for a deep learning model is a critical part of successful training, yet choosing this hyperparameter is often done empirically with trial and error. In this work, we explore a solvable model of optimal learning rate schedules for a powerlaw random feature model trained with stochastic gradient descent (SGD). We consider the optimal schedule $η_T^\star(t)$ where $t$ is the current iterate and $T$ is the total training horizon. This schedule is computed both numerically and analytically (when possible) using optimal control methods. Our analysis reveals two regimes which we term the easy phase and hard phase. In the easy phase the optimal schedule is a polynomial decay $η_T^\star(t) \simeq T^{-ξ} (1-t/T)^δ$ where $ξ$ and $δ$ depend on the properties of the features and task. In the hard phase, the optimal schedule resembles warmup-stable-decay with constant (in $T$) initial learning rate and annealing performed over a vanishing (in $T$) fraction of training steps. We investigate joint optimization of learning rate and batch size, identifying a degenerate optimality condition. Our model also predicts the compute-optimal scaling laws (where model size and training steps are chosen optimally) in both easy and hard regimes. Going beyond SGD, we consider optimal schedules for the momentum $β(t)$, where speedups in the hard phase are possible. We compare our optimal schedule to various benchmarks in our task including (1) optimal constant learning rates $η_T(t) \sim T^{-ξ}$ (2) optimal power laws $η_T(t) \sim T^{-ξ} t^{-χ}$, finding that our schedule achieves better rates than either of these. Our theory suggests that learning rate transfer across training horizon depends on the structure of the model and task. We explore these ideas in simple experimental pretraining setups.

Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model

TL;DR

This paper addresses the challenge of choosing learning-rate schedules that transfer across training horizons in SGD by analyzing a solvable power-law random-feature model. Using optimal-control theory on a reduced SGD dynamics, the authors derive horizon-dependent schedules, identifying easy () and hard () phases, with polynomial-decay schedules in the easy phase and warmup-stable-decay in the hard phase. They establish compute- and data-budget scaling laws, extend the framework to joint optimization with batch size and momentum, and demonstrate that their optimal schedules outperform constant and naive power-law baselines. Experiments on CIFAR-5M with ResNets corroborate horizon-transfer ideas and illustrate practical benefits of annealing in deep networks, while acknowledging limitations of the linear-model setting and the need for broader validation.

Abstract

Setting the learning rate for a deep learning model is a critical part of successful training, yet choosing this hyperparameter is often done empirically with trial and error. In this work, we explore a solvable model of optimal learning rate schedules for a powerlaw random feature model trained with stochastic gradient descent (SGD). We consider the optimal schedule where is the current iterate and is the total training horizon. This schedule is computed both numerically and analytically (when possible) using optimal control methods. Our analysis reveals two regimes which we term the easy phase and hard phase. In the easy phase the optimal schedule is a polynomial decay where and depend on the properties of the features and task. In the hard phase, the optimal schedule resembles warmup-stable-decay with constant (in ) initial learning rate and annealing performed over a vanishing (in ) fraction of training steps. We investigate joint optimization of learning rate and batch size, identifying a degenerate optimality condition. Our model also predicts the compute-optimal scaling laws (where model size and training steps are chosen optimally) in both easy and hard regimes. Going beyond SGD, we consider optimal schedules for the momentum , where speedups in the hard phase are possible. We compare our optimal schedule to various benchmarks in our task including (1) optimal constant learning rates (2) optimal power laws , finding that our schedule achieves better rates than either of these. Our theory suggests that learning rate transfer across training horizon depends on the structure of the model and task. We explore these ideas in simple experimental pretraining setups.
Paper Structure (29 sections, 94 equations, 7 figures, 1 table)

This paper contains 29 sections, 94 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: SGD learning rates do not automatically transfer over training horizons $T$. This motivates theory that can identify not only how to scale $\eta$ with $T$, but also how to set the entire learning rate schedule $\eta(t)$ with $T$. (a) The loss of a deep ResNet trained on CIFAR-5M. (b) Test loss of a random feature model trained with SGD as a function of fixed learning rate $\eta$. The optimal learning rate shifts leftwards in this model, mimicking the behavior of the real network training bjorck2024scaling.
  • Figure 2: Comparison of optimal learning rate schedules in the hard (top row, $b > a$) and easy (bottom row, $b < a$) phases.(a, d) Profile of the optimal learning rate $\eta_T^*(t)$. In the hard phase (a), the schedule maintains a constant maximum value for $t < t_s$ followed by a rapid annealing phase, where the annealing fraction $1 - t_s/T$ vanishes as $T \to \infty$ (see Fig. \ref{['fig:ts']}). In the easy phase (d), the schedule collapses onto the scaling form $\eta_T^*(t) \approx T^{b/a-1}f(t/T)$ (dashed theoretical curve, see Eq. \ref{['eq:eta_star_easy']}). (b, e) Evolution of the loss over training time $t$ for the optimal schedule. (c, f) Scaling of the final excess loss $L_T - \sigma^2$ with the training horizon $T$. The optimal schedule improves scaling exponent compared to the optimal constant and power-law baselines. Exponents obtained from numerical fit match theoretical predictions: for the constant and power-law case, $\zeta_T=\frac{a-1}{a+b-1}$ ($\zeta_T=1/3$ for panel (a) and $\zeta_T=5/9\approx0.56$ for panel (d)), for the optimal schedule $\zeta_T=\min(\frac{a-1}{a},\frac{a-1}{b})$ ($\zeta_T=0.5$ for panel (a) and $\zeta_T=5/7\approx0.71$ for panel (d)). Parameters:$N=1000$, $\eta_{\max}=1$, $a=3.5$, $m=5$, with $b=5$ (top row) and $b=2$ (bottom row).
  • Figure 3: Decomposition of the excess loss $L_t - \sigma^2$ into bias and variance components.(a) In the easy phase ($b=2$, $a=3.5$), the optimal schedule minimizes bias and variance simultaneously throughout the training trajectory. (b) In the hard phase ($b=5$, $a=3.5$), the schedule minimizes the bias for the majority of the training time ($t < t_s$) where the learning rate is large, while the final annealing phase ($t > t_s$) is responsible for suppressing the variance. Parameters:$T=3162$, $N=1000$, $\sigma=0.5$, $m=5$.
  • Figure 4: Compute optimal scaling. Residual loss $L_C-\sigma_0^2$ as a function of the compute $C=NT$ for different values of the model size $N$. The dashed lines indicate the theoretical prediction. Parameters:$\sigma=0.5$, $m=5$. In the easy phase $b=1.5$ and $a=2$, in the hard phase $b=2$ and $a=1.5$.
  • Figure 5: Optimal Schedule and loss dynamics for SGD + momentum. (a) For the easy task regime, the numerically optimized schedule achieves the same scaling law as SGD with optimal schedule $L_T - \sigma^2 \sim T^{-1+1/a}$. (b) The optimal momentum dynamics vary significantly across $T$ but only weakly vary with $t$. (c) The learning rate for optimal momentum schedules anneals similarly to SGD in the easy phase. (d) In the hard phase, the scaling law for the loss obtained by jointly optimizing momentum and learning rate is better than the SGD rate $T^{-(a-1)/b}$. (e)-(f) Near the end of training, the momentum variable increases and the learning rate rapidly decreases.
  • ...and 2 more figures