Table of Contents
Fetching ...

Theoretical Analysis on how Learning Rate Warmup Accelerates Convergence

Yuxing Liu, Yuze Ge, Rui Pan, An Kang, Tong Zhang

TL;DR

This work addresses why learning-rate warmup accelerates optimization in deep nets by introducing a novel $(\rho,K_0,K_\rho)$-smoothness that ties local curvature to suboptimality. It proves that warmup can yield $Θ(T)$ acceleration for GD and $Θ(√T)$ acceleration for SGD under relevant conditions, and it extends these results to ABC-noise models. Empirical validation on river-valley constructions and neural nets (ResNet on CIFAR-10 and NanoGPT on TinyShakespeare) supports the theory and demonstrates practical gains. Overall, the paper provides a unified optimization-theoretic explanation for warmup benefits and outlines potential extensions to momentum-based optimizers and broader noise settings.

Abstract

Learning rate warmup is a popular and practical technique in training large-scale deep neural networks. Despite the huge success in practice, the theoretical advantages of this strategy of gradually increasing the learning rate at the beginning of the training process have not been fully understood. To resolve this gap between theory and practice, we first propose a novel family of generalized smoothness assumptions, and validate its applicability both theoretically and empirically. Under the novel smoothness assumption, we study the convergence properties of gradient descent (GD) in both deterministic and stochastic settings. It is shown that learning rate warmup consistently accelerates GD, and GD with warmup can converge at most $Θ(T)$ times faster than with a non-increasing learning rate schedule in some specific cases, providing insights into the benefits of this strategy from an optimization theory perspective.

Theoretical Analysis on how Learning Rate Warmup Accelerates Convergence

TL;DR

This work addresses why learning-rate warmup accelerates optimization in deep nets by introducing a novel -smoothness that ties local curvature to suboptimality. It proves that warmup can yield acceleration for GD and acceleration for SGD under relevant conditions, and it extends these results to ABC-noise models. Empirical validation on river-valley constructions and neural nets (ResNet on CIFAR-10 and NanoGPT on TinyShakespeare) supports the theory and demonstrates practical gains. Overall, the paper provides a unified optimization-theoretic explanation for warmup benefits and outlines potential extensions to momentum-based optimizers and broader noise settings.

Abstract

Learning rate warmup is a popular and practical technique in training large-scale deep neural networks. Despite the huge success in practice, the theoretical advantages of this strategy of gradually increasing the learning rate at the beginning of the training process have not been fully understood. To resolve this gap between theory and practice, we first propose a novel family of generalized smoothness assumptions, and validate its applicability both theoretically and empirically. Under the novel smoothness assumption, we study the convergence properties of gradient descent (GD) in both deterministic and stochastic settings. It is shown that learning rate warmup consistently accelerates GD, and GD with warmup can converge at most times faster than with a non-increasing learning rate schedule in some specific cases, providing insights into the benefits of this strategy from an optimization theory perspective.

Paper Structure

This paper contains 40 sections, 23 theorems, 190 equations, 3 figures, 2 tables.

Key Result

Lemma 1

If a function $f:\mathbb{R}^d \to \mathbb{R}$ is $(\rho, L_0, L_\rho)$-smooth with $0\leq \rho<2$, then it is $(\alpha, K_0, K_\alpha)$-smooth with $\alpha=\frac{\rho}{2-\rho}$.

Figures (3)

  • Figure 1: Local smoothness vs. function suboptimality gap on training (a) ResNet18 on CIFAR-10 (b) NanoGPT on Tiny TinyShakespeare character dataset. Both $x$ and $y$ axes are in log scale and the color bar indicates the iteration number. We use $f^* = 0$ in the plots.
  • Figure 2: An empirical experiment based on the synthetic problem setting in Example \ref{['example:river_valley']}. The loss convergence curves are on the left side, and the learning rate dynamics are on the right side.
  • Figure 3: A comparison between warmup learning rate schedules in ResNet training. The blue line is the theoretical warmup schedule derived in Theorem \ref{['thm:GD']}, and the yellow line is the standard linear warmup. We do smoothing for the blue line in the plot to make it clearer.

Theorems & Definitions (49)

  • Definition 1
  • Lemma 1
  • Example 1
  • Example 2: Example 1, patel2022global
  • Example 3: Example 2, patel2022global
  • Lemma 2
  • Theorem 1
  • Theorem 2
  • Example 4
  • Theorem 3
  • ...and 39 more