Table of Contents
Fetching ...

Why Do We Need Warm-up? A Theoretical Perspective

Foivos Alimisis, Rustem Islamov, Aurelien Lucchi

TL;DR

This work tackles why learning-rate warm-up helps optimization in deep learning by introducing the $(H_0,H_1)$-smoothness framework, which bounds Hessian curvature by $H_0 + H_1(f(w)-f^*)$. It establishes that common neural architectures with MSE and CE losses satisfy this condition under mild weight-regularity (balancedness or L2 regularization) and uses it to prove that GD with an adaptive warm-up step-size converges faster than GD with a fixed step-size, including explicit upper and lower bounds. The authors further extend the analysis to stochastic settings and validate the theory with experiments on transformer language models and vision models, showing that the warm-up schedules can match or exceed linear warm-up in practice. The work also identifies limitations and outlines future directions, such as layer-wise extensions and curvature bounds valid across the entire training trajectory, to enhance practical applicability and theoretical sharpness.

Abstract

Learning rate warm-up - increasing the learning rate at the beginning of training - has become a ubiquitous heuristic in modern deep learning, yet its theoretical foundations remain poorly understood. In this work, we provide a principled explanation for why warm-up improves training. We rely on a generalization of the $(L_0, L_1)$-smoothness condition, which bounds local curvature as a linear function of the loss sub-optimality and exhibits desirable closure properties. We demonstrate both theoretically and empirically that this condition holds for common neural architectures trained with mean-squared error and cross-entropy losses. Under this assumption, we prove that Gradient Descent with a warm-up schedule achieves faster convergence than with a fixed step-size, establishing upper and lower complexity bounds. Finally, we validate our theoretical insights through experiments on language and vision models, confirming the practical benefits of warm-up schedules.

Why Do We Need Warm-up? A Theoretical Perspective

TL;DR

This work tackles why learning-rate warm-up helps optimization in deep learning by introducing the -smoothness framework, which bounds Hessian curvature by . It establishes that common neural architectures with MSE and CE losses satisfy this condition under mild weight-regularity (balancedness or L2 regularization) and uses it to prove that GD with an adaptive warm-up step-size converges faster than GD with a fixed step-size, including explicit upper and lower bounds. The authors further extend the analysis to stochastic settings and validate the theory with experiments on transformer language models and vision models, showing that the warm-up schedules can match or exceed linear warm-up in practice. The work also identifies limitations and outlines future directions, such as layer-wise extensions and curvature bounds valid across the entire training trajectory, to enhance practical applicability and theoretical sharpness.

Abstract

Learning rate warm-up - increasing the learning rate at the beginning of training - has become a ubiquitous heuristic in modern deep learning, yet its theoretical foundations remain poorly understood. In this work, we provide a principled explanation for why warm-up improves training. We rely on a generalization of the -smoothness condition, which bounds local curvature as a linear function of the loss sub-optimality and exhibits desirable closure properties. We demonstrate both theoretically and empirically that this condition holds for common neural architectures trained with mean-squared error and cross-entropy losses. Under this assumption, we prove that Gradient Descent with a warm-up schedule achieves faster convergence than with a fixed step-size, establishing upper and lower complexity bounds. Finally, we validate our theoretical insights through experiments on language and vision models, confirming the practical benefits of warm-up schedules.

Paper Structure

This paper contains 53 sections, 29 theorems, 303 equations, 13 figures, 2 tables.

Key Result

Proposition 3.0

Consider a deep linear network with $\ell$ layers and MSE loss: where $Y \in \mathbb{R}^{c \times m}$ are the labels, $X \in \mathbb{R}^{d \times m} (d\leq m)$ is the input, and $W_i \in \mathbb{R}^{n_{i-1} \times n_i}$, where $n_0 = c$ and $n_{\ell} = d$ are networks' weights. In the space of strongly balanced weights, i.e., when $W_i^\top W_i = W_{i+1} W_{i+1} where exact forms of $H_0$ and $H_

Figures (13)

  • Figure 1: Local smoothness approximation versus training loss for language models of varying sizes on the FineWeb dataset, using SGD at a constant LR of $10^{-4}$. Each dot represents estimated local smoothness and stochastic training loss, with color indicating training progress, while the black dashed line shows the best linear fit. For much of early training, the relation is well-approximated by a line, aside from the very initial phase where smoothness behaves differently. This deviation likely arises because the linear fit reflects only an upper bound, suggesting that a more complex functional dependence may be necessary.
  • Figure 2: Local smoothness approximation against train loss during training a ResNet50 (left) and ViT-Tiny (right) on ImageNet32, using SGD with a constant LR $10^{-4}.$
  • Figure 3: Performance of Adam (for 70M and 160M) and AdamW (for 410M with weight decay $\lambda=0.1$) when training language models with three warm-up strategies: $(H_0, H_1)$ warm-up with tuned $C$, tuned linear warm-up, and no warm-up. The last $20 \%$ of iterations is a linear decay from the peak LR to $10^{-5}$ in all cases.
  • Figure 4: Effective LR with $(H_0, H_1)$ warm-up when training language models on the FineWeb dataset for the peak LR $10^{-3}$, varying parameter in $(H_0, H_1)$ warm-up.
  • Figure 5: Performance of AdamW with weight decay $\lambda=0.05$ when training ViT model on the ImageNet32 dataset with three warm-up strategies: $(H_0, H_1)$ warm-up with tuned $C$, tuned linear warm-up, and no warm-up. All LR schedules follow cosine decay after the warm-up phase.
  • ...and 8 more figures

Theorems & Definitions (56)

  • Definition 3.1
  • Proposition 3.0
  • Proposition 3.0
  • Proposition 3.0
  • Proposition 3.0
  • Remark 3.1
  • Definition 4.1: liu2023aiming
  • Definition 4.2: polyak1963gradient
  • Theorem 4.1
  • Theorem 4.2
  • ...and 46 more