Analyzing & Reducing the Need for Learning Rate Warmup in GPT Training

Atli Kosson; Bettina Messmer; Martin Jaggi

Analyzing & Reducing the Need for Learning Rate Warmup in GPT Training

Atli Kosson, Bettina Messmer, Martin Jaggi

TL;DR

This work argues that warmup benefits training by keeping the overall size of $\Delta \mathbf{w}_t$ limited, counteracting large initial values of $\mathbf{u}_t$ and shows that the need for warmup can be significantly reduced or eliminated by modifying the optimizer to explicitly normalize $\mathbf{u}_t$ based on the aforementioned metrics.

Abstract

Learning Rate Warmup is a popular heuristic for training neural networks, especially at larger batch sizes, despite limited understanding of its benefits. Warmup decreases the update size $Δ\mathbf{w}_t = η_t \mathbf{u}_t$ early in training by using lower values for the learning rate $η_t$. In this work we argue that warmup benefits training by keeping the overall size of $Δ\mathbf{w}_t$ limited, counteracting large initial values of $\mathbf{u}_t$. Focusing on small-scale GPT training with AdamW/Lion, we explore the following question: Why and by which criteria are early updates $\mathbf{u}_t$ too large? We analyze different metrics for the update size including the $\ell_2$-norm, resulting directional change, and impact on the representations of the network, providing a new perspective on warmup. In particular, we find that warmup helps counteract large angular updates as well as a limited critical batch size early in training. Finally, we show that the need for warmup can be significantly reduced or eliminated by modifying the optimizer to explicitly normalize $\mathbf{u}_t$ based on the aforementioned metrics.

Analyzing & Reducing the Need for Learning Rate Warmup in GPT Training

TL;DR

This work argues that warmup benefits training by keeping the overall size of

limited, counteracting large initial values of

and shows that the need for warmup can be significantly reduced or eliminated by modifying the optimizer to explicitly normalize

based on the aforementioned metrics.

Abstract

Learning Rate Warmup is a popular heuristic for training neural networks, especially at larger batch sizes, despite limited understanding of its benefits. Warmup decreases the update size

early in training by using lower values for the learning rate

. In this work we argue that warmup benefits training by keeping the overall size of

limited, counteracting large initial values of

. Focusing on small-scale GPT training with AdamW/Lion, we explore the following question: Why and by which criteria are early updates

too large? We analyze different metrics for the update size including the

-norm, resulting directional change, and impact on the representations of the network, providing a new perspective on warmup. In particular, we find that warmup helps counteract large angular updates as well as a limited critical batch size early in training. Finally, we show that the need for warmup can be significantly reduced or eliminated by modifying the optimizer to explicitly normalize

based on the aforementioned metrics.

Analyzing & Reducing the Need for Learning Rate Warmup in GPT Training

TL;DR

Abstract

Analyzing & Reducing the Need for Learning Rate Warmup in GPT Training

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)