Why Do We Need Warm-up? A Theoretical Perspective

Foivos Alimisis; Rustem Islamov; Aurelien Lucchi

Why Do We Need Warm-up? A Theoretical Perspective

Foivos Alimisis, Rustem Islamov, Aurelien Lucchi

TL;DR

This work tackles why learning-rate warm-up helps optimization in deep learning by introducing the $(H_0,H_1)$-smoothness framework, which bounds Hessian curvature by $H_0 + H_1(f(w)-f^*)$. It establishes that common neural architectures with MSE and CE losses satisfy this condition under mild weight-regularity (balancedness or L2 regularization) and uses it to prove that GD with an adaptive warm-up step-size converges faster than GD with a fixed step-size, including explicit upper and lower bounds. The authors further extend the analysis to stochastic settings and validate the theory with experiments on transformer language models and vision models, showing that the warm-up schedules can match or exceed linear warm-up in practice. The work also identifies limitations and outlines future directions, such as layer-wise extensions and curvature bounds valid across the entire training trajectory, to enhance practical applicability and theoretical sharpness.

Abstract

Learning rate warm-up - increasing the learning rate at the beginning of training - has become a ubiquitous heuristic in modern deep learning, yet its theoretical foundations remain poorly understood. In this work, we provide a principled explanation for why warm-up improves training. We rely on a generalization of the $(L_0, L_1)$-smoothness condition, which bounds local curvature as a linear function of the loss sub-optimality and exhibits desirable closure properties. We demonstrate both theoretically and empirically that this condition holds for common neural architectures trained with mean-squared error and cross-entropy losses. Under this assumption, we prove that Gradient Descent with a warm-up schedule achieves faster convergence than with a fixed step-size, establishing upper and lower complexity bounds. Finally, we validate our theoretical insights through experiments on language and vision models, confirming the practical benefits of warm-up schedules.

Why Do We Need Warm-up? A Theoretical Perspective

TL;DR

This work tackles why learning-rate warm-up helps optimization in deep learning by introducing the

-smoothness framework, which bounds Hessian curvature by

. It establishes that common neural architectures with MSE and CE losses satisfy this condition under mild weight-regularity (balancedness or L2 regularization) and uses it to prove that GD with an adaptive warm-up step-size converges faster than GD with a fixed step-size, including explicit upper and lower bounds. The authors further extend the analysis to stochastic settings and validate the theory with experiments on transformer language models and vision models, showing that the warm-up schedules can match or exceed linear warm-up in practice. The work also identifies limitations and outlines future directions, such as layer-wise extensions and curvature bounds valid across the entire training trajectory, to enhance practical applicability and theoretical sharpness.

Abstract

-smoothness condition, which bounds local curvature as a linear function of the loss sub-optimality and exhibits desirable closure properties. We demonstrate both theoretically and empirically that this condition holds for common neural architectures trained with mean-squared error and cross-entropy losses. Under this assumption, we prove that Gradient Descent with a warm-up schedule achieves faster convergence than with a fixed step-size, establishing upper and lower complexity bounds. Finally, we validate our theoretical insights through experiments on language and vision models, confirming the practical benefits of warm-up schedules.

Why Do We Need Warm-up? A Theoretical Perspective

TL;DR

Abstract

Why Do We Need Warm-up? A Theoretical Perspective

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (56)