Optimal Learning-Rate Schedules under Functional Scaling Laws: Power Decay and Warmup-Stable-Decay
Binghui Li, Zilin Wang, Fengling Chen, Shiyang Zhao, Ruiheng Zheng, Lei Wu
TL;DR
The paper addresses designing optimal learning-rate schedules under the functional scaling law (FSL) for a fixed training horizon. By recasting SGD dynamics in intrinsic time, it derives a principled variational problem whose solution yields a power-decay schedule in the easy regime and a warmup-stable-decay (WSD) structure in the hard regime, with precise scaling: $\\eta^*(z)=\\eta_{\\mathrm{peak}}(1-z/N)^{2\\beta-1}$ and $\\eta_{\\mathrm{peak}} \\\sim N^{-\\frac{1+s\\beta-\\beta}{1+s\\beta}}$ for $s \\\ge\\ 1-1/\\beta$, while $E_N^* \\\sim N^{-s}$ in the hard regime $s<1-1/\\beta$. The analysis also introduces shape-fixed fractional LRSs, revealing a capacity-saturation phenomenon governed by $\\alpha=\\min\\{\\beta, \\gamma+1\\}$ and showing how common schedules like cosine can saturate capacity in practical settings. Extending the framework to one-pass SGD for kernel regression, the authors prove that the power-decay LRS achieves the minimax-optimal rate on the last iterate, removing logarithmic penalties. Overall, the work provides a unified, theoretically grounded lens for LRS design that bridges theory and practice across linear regression, kernel methods, and large-scale model training.
Abstract
We study optimal learning-rate schedules (LRSs) under the functional scaling law (FSL) framework introduced in Li et al. (2025), which accurately models the loss dynamics of both linear regression and large language model (LLM) pre-training. Within FSL, loss dynamics are governed by two exponents: a source exponent $s>0$ controlling the rate of signal learning, and a capacity exponent $β>1$ determining the rate of noise forgetting. Focusing on a fixed training horizon $N$, we derive the optimal LRSs and reveal a sharp phase transition. In the easy-task regime $s \ge 1 - 1/β$, the optimal schedule follows a power decay to zero, $η^*(z) = η_{\mathrm{peak}}(1 - z/N)^{2β- 1}$, where the peak learning rate scales as $η_{\mathrm{peak}} \eqsim N^{-ν}$ for an explicit exponent $ν= ν(s,β)$. In contrast, in the hard-task regime $s < 1 - 1/β$, the optimal LRS exhibits a warmup-stable-decay (WSD) (Hu et al. (2024)) structure: it maintains the largest admissible learning rate for most of training and decays only near the end, with the decay phase occupying a vanishing fraction of the horizon. We further analyze optimal shape-fixed schedules, where only the peak learning rate is tuned -- a strategy widely adopted in practiceand characterize their strengths and intrinsic limitations. This yields a principled evaluation of commonly used schedules such as cosine and linear decay. Finally, we apply the power-decay LRS to one-pass stochastic gradient descent (SGD) for kernel regression and show the last iterate attains the exact minimax-optimal rate, eliminating the logarithmic suboptimality present in prior analyses. Numerical experiments corroborate our theoretical predictions.
