Table of Contents
Fetching ...

Optimal Learning-Rate Schedules under Functional Scaling Laws: Power Decay and Warmup-Stable-Decay

Binghui Li, Zilin Wang, Fengling Chen, Shiyang Zhao, Ruiheng Zheng, Lei Wu

TL;DR

The paper addresses designing optimal learning-rate schedules under the functional scaling law (FSL) for a fixed training horizon. By recasting SGD dynamics in intrinsic time, it derives a principled variational problem whose solution yields a power-decay schedule in the easy regime and a warmup-stable-decay (WSD) structure in the hard regime, with precise scaling: $\\eta^*(z)=\\eta_{\\mathrm{peak}}(1-z/N)^{2\\beta-1}$ and $\\eta_{\\mathrm{peak}} \\\sim N^{-\\frac{1+s\\beta-\\beta}{1+s\\beta}}$ for $s \\\ge\\ 1-1/\\beta$, while $E_N^* \\\sim N^{-s}$ in the hard regime $s<1-1/\\beta$. The analysis also introduces shape-fixed fractional LRSs, revealing a capacity-saturation phenomenon governed by $\\alpha=\\min\\{\\beta, \\gamma+1\\}$ and showing how common schedules like cosine can saturate capacity in practical settings. Extending the framework to one-pass SGD for kernel regression, the authors prove that the power-decay LRS achieves the minimax-optimal rate on the last iterate, removing logarithmic penalties. Overall, the work provides a unified, theoretically grounded lens for LRS design that bridges theory and practice across linear regression, kernel methods, and large-scale model training.

Abstract

We study optimal learning-rate schedules (LRSs) under the functional scaling law (FSL) framework introduced in Li et al. (2025), which accurately models the loss dynamics of both linear regression and large language model (LLM) pre-training. Within FSL, loss dynamics are governed by two exponents: a source exponent $s>0$ controlling the rate of signal learning, and a capacity exponent $β>1$ determining the rate of noise forgetting. Focusing on a fixed training horizon $N$, we derive the optimal LRSs and reveal a sharp phase transition. In the easy-task regime $s \ge 1 - 1/β$, the optimal schedule follows a power decay to zero, $η^*(z) = η_{\mathrm{peak}}(1 - z/N)^{2β- 1}$, where the peak learning rate scales as $η_{\mathrm{peak}} \eqsim N^{-ν}$ for an explicit exponent $ν= ν(s,β)$. In contrast, in the hard-task regime $s < 1 - 1/β$, the optimal LRS exhibits a warmup-stable-decay (WSD) (Hu et al. (2024)) structure: it maintains the largest admissible learning rate for most of training and decays only near the end, with the decay phase occupying a vanishing fraction of the horizon. We further analyze optimal shape-fixed schedules, where only the peak learning rate is tuned -- a strategy widely adopted in practiceand characterize their strengths and intrinsic limitations. This yields a principled evaluation of commonly used schedules such as cosine and linear decay. Finally, we apply the power-decay LRS to one-pass stochastic gradient descent (SGD) for kernel regression and show the last iterate attains the exact minimax-optimal rate, eliminating the logarithmic suboptimality present in prior analyses. Numerical experiments corroborate our theoretical predictions.

Optimal Learning-Rate Schedules under Functional Scaling Laws: Power Decay and Warmup-Stable-Decay

TL;DR

The paper addresses designing optimal learning-rate schedules under the functional scaling law (FSL) for a fixed training horizon. By recasting SGD dynamics in intrinsic time, it derives a principled variational problem whose solution yields a power-decay schedule in the easy regime and a warmup-stable-decay (WSD) structure in the hard regime, with precise scaling: and for , while in the hard regime . The analysis also introduces shape-fixed fractional LRSs, revealing a capacity-saturation phenomenon governed by and showing how common schedules like cosine can saturate capacity in practical settings. Extending the framework to one-pass SGD for kernel regression, the authors prove that the power-decay LRS achieves the minimax-optimal rate on the last iterate, removing logarithmic penalties. Overall, the work provides a unified, theoretically grounded lens for LRS design that bridges theory and practice across linear regression, kernel methods, and large-scale model training.

Abstract

We study optimal learning-rate schedules (LRSs) under the functional scaling law (FSL) framework introduced in Li et al. (2025), which accurately models the loss dynamics of both linear regression and large language model (LLM) pre-training. Within FSL, loss dynamics are governed by two exponents: a source exponent controlling the rate of signal learning, and a capacity exponent determining the rate of noise forgetting. Focusing on a fixed training horizon , we derive the optimal LRSs and reveal a sharp phase transition. In the easy-task regime , the optimal schedule follows a power decay to zero, , where the peak learning rate scales as for an explicit exponent . In contrast, in the hard-task regime , the optimal LRS exhibits a warmup-stable-decay (WSD) (Hu et al. (2024)) structure: it maintains the largest admissible learning rate for most of training and decays only near the end, with the decay phase occupying a vanishing fraction of the horizon. We further analyze optimal shape-fixed schedules, where only the peak learning rate is tuned -- a strategy widely adopted in practiceand characterize their strengths and intrinsic limitations. This yields a principled evaluation of commonly used schedules such as cosine and linear decay. Finally, we apply the power-decay LRS to one-pass stochastic gradient descent (SGD) for kernel regression and show the last iterate attains the exact minimax-optimal rate, eliminating the logarithmic suboptimality present in prior analyses. Numerical experiments corroborate our theoretical predictions.
Paper Structure (56 sections, 19 theorems, 176 equations, 1 figure)

This paper contains 56 sections, 19 theorems, 176 equations, 1 figure.

Key Result

Theorem 4.1

Let $t_*$ be a minimizer of equ: functional_physical and define $\eta^*(z)\coloneqq t_*'(z)$ and $\mathcal{E}_N^*=\bar{\mathcal{F}}[t_*]$ be the final-step loss. Then the following holds.

Figures (1)

  • Figure 1: (left) Illustration of optimal learning-rate schedules (LRSs): power decay in the easy-task regime and WSD with power decay in the hard-task regime. (middle) Performance comparison of cosine ($\gamma=2$) and power-decay ($\gamma=4.2$) LRSs for feature-space linear regression with source exponent $s=0.8$ and capacity exponent $\beta=5$. Power decay achieves the minimax-optimal rate $N^{-\beta s/(\beta s+1)}$, whereas cosine decay suffers from capacity saturation and exhibits the suboptimal rate predicted by our theory (corresponding to the green region in the right phase diagram). For each data size, we perform $500$ independent runs of SGD, and tune the peak learning rate to minimize the average final-step loss. (right) Phase diagram of convergence rates under shape-fixed fractional LRSs (Theorem \ref{['thm: optimal-fractional-lrs']}). Each region in the $(\beta,s)$ plane corresponds to distinct convergence rates. The vertical boundary $\beta=\gamma+1$ marks a capacity-saturation threshold induced by fixing the decay shape; in the green region, this restriction leads to suboptimal convergence rates.

Theorems & Definitions (41)

  • Remark 3.3
  • Theorem 4.1: Optimal learning-rate schedules
  • Definition 5.1: Fractional LRS with power-decay tail
  • Theorem 5.2: Scaling law for fractional LRS
  • Theorem 5.3: Optimal fractional LRS
  • Remark 5.4
  • Theorem 6.3: Convergence rate of SGD with power decay
  • Proposition 6.4
  • Remark A.1
  • Theorem A.2: KKT condition
  • ...and 31 more