Scaling Law with Learning Rate Annealing

Howe Tissue; Venus Wang; Lu Wang

Scaling Law with Learning Rate Annealing

Howe Tissue, Venus Wang, Lu Wang

TL;DR

This work is promising to enhance the understanding of LLM training dynamics while greatly democratizing scaling laws, and it can guide researchers in refining training strategies (e.g. critical LRS) for further LLMs.

Abstract

We find that the cross-entropy loss curves of neural language models empirically adhere to a scaling law with learning rate (LR) annealing over training steps: $$L(s) = L_0 + A\cdot S_1^{-α} - C\cdot S_2,$$ where $L(s)$ is the validation loss at step $s$, $S_1$ is the area under the LR curve, $S_2$ is the LR annealing area, and $L_0$, $A$, $C$, $α$ are constant parameters. This formulation takes into account two factors: (1) power-law scaling over data size, and (2) the additional loss reduction during LR annealing. Therefore, this formulation can describe the full loss curve at each step, rather than the single loss point at the end of training. Applying the scaling law with LR annealing and fitting only one or two training curves, we can accurately predict the loss at any given step across any learning rate scheduler (LRS). This approach significantly reduces computational cost in formulating scaling laws while providing more accuracy and expressiveness for training dynamics. Extensive experiments demonstrate that our findings hold across a range of hyper-parameters and model architectures, and our equation can extend to scaling effect of model sizes. Moreover, our formulation provides accurate theoretical verification and explanation for empirical results observed in numerous previous studies, particularly those focusing on LR schedule and annealing. We believe that this work is promising to enhance the understanding of LLM training dynamics while greatly democratizing scaling laws, and it can guide researchers in refining training strategies (e.g. critical LRS) for further LLMs.

Scaling Law with Learning Rate Annealing

TL;DR

Abstract

We find that the cross-entropy loss curves of neural language models empirically adhere to a scaling law with learning rate (LR) annealing over training steps:

where

is the validation loss at step

is the area under the LR curve,

is the LR annealing area, and

are constant parameters. This formulation takes into account two factors: (1) power-law scaling over data size, and (2) the additional loss reduction during LR annealing. Therefore, this formulation can describe the full loss curve at each step, rather than the single loss point at the end of training. Applying the scaling law with LR annealing and fitting only one or two training curves, we can accurately predict the loss at any given step across any learning rate scheduler (LRS). This approach significantly reduces computational cost in formulating scaling laws while providing more accuracy and expressiveness for training dynamics. Extensive experiments demonstrate that our findings hold across a range of hyper-parameters and model architectures, and our equation can extend to scaling effect of model sizes. Moreover, our formulation provides accurate theoretical verification and explanation for empirical results observed in numerous previous studies, particularly those focusing on LR schedule and annealing. We believe that this work is promising to enhance the understanding of LLM training dynamics while greatly democratizing scaling laws, and it can guide researchers in refining training strategies (e.g. critical LRS) for further LLMs.

Paper Structure (63 sections, 12 equations, 29 figures, 3 tables)

This paper contains 63 sections, 12 equations, 29 figures, 3 tables.

Introduction
Preliminary
Scaling Laws
Learning Rate Annealing
Theory
Similarity between Learning Rate, Gradient Norm, and Loss
Scaling Laws for Constant LRS.
Extra Loss Changes in LR Annealing.
Training Discount in Annealing.
LR Annealing Momentum
Final Formulation
Scaling Law with LR Annealing.
Only One Extra Parameter.
Loss Surface as a Slide.
Balance between $S_1$ and $S_2$.
...and 48 more sections

Figures (29)

Figure 1: Visualization of $S_1$ and $S_2$ at the 20-th step of a cosine LR scheduler. $S_1$ is the forward area, i.e., sum of red grid areas, which can be approximately regarded as the total amount of movement for neural network parameters; $S_2$ is the decayed annealing area, i.e., weighted sum of blue grid areas, where lighter shades indicate smaller weights. Both $S_1$ and $S_2$ contribute to loss reduction, and balancing their values is crucial for achieving the lowest possible final loss.
Figure 2: Using Eq. \ref{['eq:scaling']} to fit full loss curves yield by constant and cosine LRS. Total steps = 20K, $\eta_{max}=2\times10^{-4}$, $\eta_{min}=0$. The fitted equation is $L(s) = 2.628 + 0.429\cdot S_1^{-0.550} - 0.411\cdot S_2$.
Figure 3: Using the fitted equation from Fig. \ref{['fig:fit']} to predict full loss curves for unseen LRS with 60K total steps. The left, middle, and right columns present the LR curve, the loss curve, and a zoomed-in view of loss curve, respectively. Warmup steps (500) are not shown in this figure. The fitted equation accurately predicts each loss curve, particularly for capturing the trend of loss changes as the LR varies. Notable, all LRS and loss curves shown here were unseen during the fitting in Fig. \ref{['fig:fit']}. The mean prediction errors across different LRS is as low as $\sim 0.2\%$.
Figure 4: The shapes of LR (top), gradient norm (medium), and validation loss (bottom) curves exhibit high similarity across various LRS (labeled as different colors).
Figure 5: The delay phenomenon between the LR and validation loss curves. This phenomenon suggests that LR annealing (re-warmup) has momentum.
...and 24 more figures

Scaling Law with Learning Rate Annealing

TL;DR

Abstract

Scaling Law with Learning Rate Annealing

Authors

TL;DR

Abstract

Table of Contents

Figures (29)