Table of Contents
Fetching ...

A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules

Kairong Luo, Haodong Wen, Shengding Hu, Zhenbo Sun, Zhiyuan Liu, Maosong Sun, Kaifeng Lyu, Wenguang Chen

TL;DR

The paper introduces the Multi-Power Law (MPL), a schedule-aware empirical law that predicts the pretraining loss curve across learning-rate schedules by combining a base power-law in the LR-sum with a loss-reduction correction LD that captures LR decay effects. MPL is derived via a bottom-up, LR-sum-matching approach and validated across model sizes and architectures, demonstrating strong generalization to unseen and longer-horizon schedules. The authors show that fitting MPL on a small set of schedules enables accurate prediction of entire loss trajectories and enables optimization of LR schedules that outperform cosine and tuned Warmup-Stable-Decay patterns, with downstream gains. A theoretical analysis under quadratic loss and spectral assumptions links MPL to power-law structures in the Hessian and gradient noise, and extensive ablations confirm robustness across architectures, sizes, and hyperparameters. Overall, MPL offers a practical, data-efficient tool for understanding and designing LR schedules to improve training efficiency in large-language-model pretraining.

Abstract

Training large models is both resource-intensive and time-consuming, making it crucial to understand the quantitative relationship between model performance and hyperparameters. In this paper, we present an empirical law that describes how the pretraining loss of large language models evolves under different learning rate schedules, such as constant, cosine, and step decay schedules. Our proposed law takes a multi-power form, combining a power law based on the sum of learning rates and additional power laws to account for a loss reduction effect induced by learning rate decay. We extensively validate this law on various model sizes and architectures, and demonstrate that after fitting on a few learning rate schedules, the law accurately predicts the loss curves for unseen schedules of different shapes and horizons. Moreover, by minimizing the predicted final pretraining loss across learning rate schedules, we are able to find a schedule that outperforms the widely used cosine learning rate schedule. Interestingly, this automatically discovered schedule bears some resemblance to the recently proposed Warmup-Stable-Decay (WSD) schedule (Hu et al, 2024) but achieves a slightly lower final loss. We believe these results could offer valuable insights for understanding the dynamics of pretraining and designing learning rate schedules to improve efficiency.

A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules

TL;DR

The paper introduces the Multi-Power Law (MPL), a schedule-aware empirical law that predicts the pretraining loss curve across learning-rate schedules by combining a base power-law in the LR-sum with a loss-reduction correction LD that captures LR decay effects. MPL is derived via a bottom-up, LR-sum-matching approach and validated across model sizes and architectures, demonstrating strong generalization to unseen and longer-horizon schedules. The authors show that fitting MPL on a small set of schedules enables accurate prediction of entire loss trajectories and enables optimization of LR schedules that outperform cosine and tuned Warmup-Stable-Decay patterns, with downstream gains. A theoretical analysis under quadratic loss and spectral assumptions links MPL to power-law structures in the Hessian and gradient noise, and extensive ablations confirm robustness across architectures, sizes, and hyperparameters. Overall, MPL offers a practical, data-efficient tool for understanding and designing LR schedules to improve training efficiency in large-language-model pretraining.

Abstract

Training large models is both resource-intensive and time-consuming, making it crucial to understand the quantitative relationship between model performance and hyperparameters. In this paper, we present an empirical law that describes how the pretraining loss of large language models evolves under different learning rate schedules, such as constant, cosine, and step decay schedules. Our proposed law takes a multi-power form, combining a power law based on the sum of learning rates and additional power laws to account for a loss reduction effect induced by learning rate decay. We extensively validate this law on various model sizes and architectures, and demonstrate that after fitting on a few learning rate schedules, the law accurately predicts the loss curves for unseen schedules of different shapes and horizons. Moreover, by minimizing the predicted final pretraining loss across learning rate schedules, we are able to find a schedule that outperforms the widely used cosine learning rate schedule. Interestingly, this automatically discovered schedule bears some resemblance to the recently proposed Warmup-Stable-Decay (WSD) schedule (Hu et al, 2024) but achieves a slightly lower final loss. We believe these results could offer valuable insights for understanding the dynamics of pretraining and designing learning rate schedules to improve efficiency.

Paper Structure

This paper contains 87 sections, 11 theorems, 98 equations, 21 figures, 7 tables.

Key Result

Theorem 1

Under asp:power-spectra, for all $1 \le t \le T$, if $\eta_{\max} := \max_{0 \le t \le T} \{ \eta_t \}$ is sufficiently small and $S_1(t)$ is sufficiently large, then we have the following estimate of $\mathbb{E}[\mathcal{L}({\bm{\theta}}_t)]$:

Figures (21)

  • Figure 1: Optimizing the LR schedule induces a schedule (Opt) better than cosine and WSD schedules. We conduct evaluation experiments on a $400$M Llama-2 touvron2023llama model trained over $12$B tokens. Zoom-in regions facilitate the readers who are interested in the local details. (a) Our optimized schedule comprises constant and decay stages post-warmup, aligning with WSD hu2024minicpm. (b) Loss curves demonstrate that our optimized schedule outperforms cosine schedules and two major variants of WSD with tuned hyperparameters (WSD with exponential decay and WSDLD with linear decay).
  • Figure 2: The Multi-Power Law (MPL) with parameters fitted on cosine, constant, and two-stage schedules can accurately predict the loss curves of unseen schedules, including WSDLD, WSD, and two-stage schedules with a different LR in the second stage. See \ref{['tab:comp']} for evaluation metrics.
  • Figure 3: A multi-stage schedule (\ref{['sec:multi-stage']}) example to illustrate the learning rate (LR) sum matching (\ref{['sec:approach']}) and fine-grained loss reduction decomposition (\ref{['sec:fine-grained']}). The steps with equal LR sum as the final step $T_9=8720$ are marked and linked with the dash-point line. Each stage spans 90 steps. $T_1=8000$, $T_2=8090$, $t^{(1)}=Z_{T_2}(T_9)$, $t^{(2)}=Z_{T_3}(T_9)$. See \ref{['app:mulit-stage']} for experiment details. Left: The actual multi-stage schedule and schedules for auxiliary processes. LR gap between adjacent points denotes the LR reduction $\Delta \eta^{(i)}=\eta^{(i-1)}-\eta^{(i)}$. Right: Corresponding training curves for the multi-stage schedule and the auxiliary processes. The total loss reduction is $\mathrm{LD}(T_9)$ and can be decomposed as the intermediate loss reduction sum. The loss gap between adjacent points denotes the stage-wise loss reduction $\mathrm{LD}^{(i)}(t^{(i)})$.
  • Figure 4: Loss reduction (LD) of two-stage schedule exhibits a power law. Example setting: $t_B=11000$, $x_B=3000$, $\eta_{\mathrm{B}}=9\times 10^{-5}$, $\eta_{\mathrm{A}}=3\times 10^{-4}$, $T_{\mathrm{A}}=8000$. (a) A and B have the equal LR sums: $x_A=900$, $t_A=8900$. (b) Loss reduction at $B$: $\mathrm{LD}(T_A+x_B)=\mathcal{L}_A(t_A)-\mathcal{L}_B(t_B)$. (c) Fitting loss reduction $\widehat{\mathrm{LD}}(T_{\mathrm{A}}+x_B)$ with power form results in $0.13(1 -(1 + 0.21x)^{0.15})$; Fitting with exponential form results in $0.0790(1-e^{-0.01x})$. The shape of loss reduction is closer to a power form than exponential.
  • Figure 5: The dependency patterns of $\Tilde{B}$, $\Tilde{C}$ over $\eta_A$, $\eta_B$ and $T_A$ in the two-stage cases. $\Tilde{B}$ is approximately proportional to $\eta_A -\eta_B$, and $\Tilde{C}$ manifests power-law pattern over $\eta_B$. The dependency of $\eta_A$ over $\tilde{C}$ and the impacts of $T_A$ on $\Tilde{B}$, $\Tilde{C}$ are unpredictable or negligible, which are approximately ignored in our discussion.
  • ...and 16 more figures

Theorems & Definitions (22)

  • Theorem 1
  • Theorem 2
  • Lemma 1
  • proof
  • proof : Proof for \ref{['thm;opt-lr-schedule']}
  • Theorem 3
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • ...and 12 more