Table of Contents
Fetching ...

Analyzing & Reducing the Need for Learning Rate Warmup in GPT Training

Atli Kosson, Bettina Messmer, Martin Jaggi

TL;DR

This work argues that warmup benefits training by keeping the overall size of $\Delta \mathbf{w}_t$ limited, counteracting large initial values of $\mathbf{u}_t$ and shows that the need for warmup can be significantly reduced or eliminated by modifying the optimizer to explicitly normalize $\mathbf{u}_t$ based on the aforementioned metrics.

Abstract

Learning Rate Warmup is a popular heuristic for training neural networks, especially at larger batch sizes, despite limited understanding of its benefits. Warmup decreases the update size $Δ\mathbf{w}_t = η_t \mathbf{u}_t$ early in training by using lower values for the learning rate $η_t$. In this work we argue that warmup benefits training by keeping the overall size of $Δ\mathbf{w}_t$ limited, counteracting large initial values of $\mathbf{u}_t$. Focusing on small-scale GPT training with AdamW/Lion, we explore the following question: Why and by which criteria are early updates $\mathbf{u}_t$ too large? We analyze different metrics for the update size including the $\ell_2$-norm, resulting directional change, and impact on the representations of the network, providing a new perspective on warmup. In particular, we find that warmup helps counteract large angular updates as well as a limited critical batch size early in training. Finally, we show that the need for warmup can be significantly reduced or eliminated by modifying the optimizer to explicitly normalize $\mathbf{u}_t$ based on the aforementioned metrics.

Analyzing & Reducing the Need for Learning Rate Warmup in GPT Training

TL;DR

This work argues that warmup benefits training by keeping the overall size of limited, counteracting large initial values of and shows that the need for warmup can be significantly reduced or eliminated by modifying the optimizer to explicitly normalize based on the aforementioned metrics.

Abstract

Learning Rate Warmup is a popular heuristic for training neural networks, especially at larger batch sizes, despite limited understanding of its benefits. Warmup decreases the update size early in training by using lower values for the learning rate . In this work we argue that warmup benefits training by keeping the overall size of limited, counteracting large initial values of . Focusing on small-scale GPT training with AdamW/Lion, we explore the following question: Why and by which criteria are early updates too large? We analyze different metrics for the update size including the -norm, resulting directional change, and impact on the representations of the network, providing a new perspective on warmup. In particular, we find that warmup helps counteract large angular updates as well as a limited critical batch size early in training. Finally, we show that the need for warmup can be significantly reduced or eliminated by modifying the optimizer to explicitly normalize based on the aforementioned metrics.

Paper Structure

This paper contains 29 sections, 40 equations, 12 figures, 3 algorithms.

Figures (12)

  • Figure 1: Warmup significantly benefits GPT2 training with AdamW. Panel 1: Trapezoidal learning rate schedules with different warmup lengths and 50% linear cooldown. Panel 2: Final validation loss for various learning rate and warmup configurations. Note the performance gap between no-warmup (black) and other configurations. Panel 3: Training curves comparing the best no-warmup run to a 5% warmup with the same learning rate. The warmup run quickly surpasses the no-warmup run. Panel 4: Comparison of $\ell_2$ update norms for these runs shows large initial updates without warmup.
  • Figure 2: LionA (\ref{['alg:liona']}) fails to significantly reduce the warmup advantage. Panel 1: Final validation loss across various learning rates and warmup percentages shows a reduced but still significant no-warmup penalty compared to AdamW (\ref{['fig:baseline_lr_wps']}). Panel 2: Training curves for 0% vs. 5% warmup at the highest stable learning rate for 0%, with warmup quickly overtaking no-warmup as before. Panel 3: LionA successfully controls the $\ell_2$-update norm. Panel 4: Early angular updates (see \ref{['sec:angular']}) are large without warmup and do not follow the learning rate schedule throughout training.
  • Figure 3: LionAR (\ref{['alg:lionar']}) reduces but does not fully eliminate the benefit of warmup. Panel 1: LionAR is more stable across learning rates and shows a reduced but still significant performance gap without warmup. Panel 2: Comparing the 0% and 5% warmup for learning rate $\approx\!10^{-2}$ shows the warmup run overtaking early in training. Panel 3: LionAR precisely controls the angular update size throughout training. Panel 4: Despite fixed angular (and thus relative) updates in weight space, the relative change of the internal representations (see \ref{['sec:snr_rrc']}) is large initially without warmup.
  • Figure 4: \ref{['eq:rrc_snr']} predicts that the learning rate needs to be downscaled for higher signal to noise ratios ($\varphi$) to keep the relative representation change constant. Larger batch sizes are affected more, with scaling becoming significant when $\varphi > B^{-1}$. Panel 2: Measurements of the SNR for the two highlighted runs in \ref{['fig:lionar_lr_wps']}. Note the SNR starts very high but is also remains large in comparison to our $B=480$ for almost all of training. Panel 3: The gradient is strongly oppositely aligned with the momentum vector for most of training (shown for an example layer). Panel 4: Projecting the momentum component of the updates onto the gradient component shows that this results in the momentum vector "cancelling" roughly half the gradient on average.
  • Figure 5: Panel 1: LionAR with a correction factor for the RRC based on \ref{['eq:rrc_snr']} does not benefit from a warmup. Panel 2: LionAR training without momentum results in drastically lower performance. Panel 3: In LionAR with increased momentum $\beta=0.98$, Nesterov momentum and an inverse bias correction for early momentum, no warmup performs best. Panel 4: The same does not apply to LionA, suggesting that these changes are not sufficient without controlling the angular updates.
  • ...and 7 more figures