Table of Contents
Fetching ...

On the adequacy of untuned warmup for adaptive optimization

Jerry Ma, Denis Yarats

TL;DR

The paper challenges the variance-based justification for Rectified Adam (RAdam) and shows that untuned warmup for Adam performs comparably in typical settings. By shifting focus to the magnitudes of update steps, it demonstrates that warmup is primarily about stabilizing early updates, not rectifying variance. It introduces simple, tuneless warmup schedules—exponential and linear—with practical guidance, and empirically validates their parity with RAdam across image classification, language modeling, and machine translation. The work argues for using linear warmup over 2/(1 − β2) iterations as a robust default and suggests exploring dynamic warmup strategies in the future. Overall, it questions the necessity of RAdam and emphasizes straightforward warmup as an effective, low-complexity alternative.

Abstract

Adaptive optimization algorithms such as Adam are widely used in deep learning. The stability of such algorithms is often improved with a warmup schedule for the learning rate. Motivated by the difficulty of choosing and tuning warmup schedules, recent work proposes automatic variance rectification of Adam's adaptive learning rate, claiming that this rectified approach ("RAdam") surpasses the vanilla Adam algorithm and reduces the need for expensive tuning of Adam with warmup. In this work, we refute this analysis and provide an alternative explanation for the necessity of warmup based on the magnitude of the update term, which is of greater relevance to training stability. We then provide some "rule-of-thumb" warmup schedules, and we demonstrate that simple untuned warmup of Adam performs more-or-less identically to RAdam in typical practical settings. We conclude by suggesting that practitioners stick to linear warmup with Adam, with a sensible default being linear warmup over $2 / (1 - β_2)$ training iterations.

On the adequacy of untuned warmup for adaptive optimization

TL;DR

The paper challenges the variance-based justification for Rectified Adam (RAdam) and shows that untuned warmup for Adam performs comparably in typical settings. By shifting focus to the magnitudes of update steps, it demonstrates that warmup is primarily about stabilizing early updates, not rectifying variance. It introduces simple, tuneless warmup schedules—exponential and linear—with practical guidance, and empirically validates their parity with RAdam across image classification, language modeling, and machine translation. The work argues for using linear warmup over 2/(1 − β2) iterations as a robust default and suggests exploring dynamic warmup strategies in the future. Overall, it questions the necessity of RAdam and emphasizes straightforward warmup as an effective, low-complexity alternative.

Abstract

Adaptive optimization algorithms such as Adam are widely used in deep learning. The stability of such algorithms is often improved with a warmup schedule for the learning rate. Motivated by the difficulty of choosing and tuning warmup schedules, recent work proposes automatic variance rectification of Adam's adaptive learning rate, claiming that this rectified approach ("RAdam") surpasses the vanilla Adam algorithm and reduces the need for expensive tuning of Adam with warmup. In this work, we refute this analysis and provide an alternative explanation for the necessity of warmup based on the magnitude of the update term, which is of greater relevance to training stability. We then provide some "rule-of-thumb" warmup schedules, and we demonstrate that simple untuned warmup of Adam performs more-or-less identically to RAdam in typical practical settings. We conclude by suggesting that practitioners stick to linear warmup with Adam, with a sensible default being linear warmup over training iterations.

Paper Structure

This paper contains 31 sections, 22 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Analysis of gradients and updates during the training of a simple feed-forward network on the EMNIST digit recognition task with the Adam optimizer -- see \ref{['apx:training-details-emnist']} for comprehensive details.
  • Figure 2: Distribution of Adam's update step magnitudes at a simulated local minimum of ${\mathcal{L}}(\theta)$ (quantiles: $\left\{ 2.5\%, 25\%, 50\%, 75\%, 97.5\% \right\}$).
  • Figure 3: Comparison of various characteristics of RAdam and rule-of-thumb warmup schedules.
  • Figure 4: Mean training loss (5 seeds) of ResNet-50 on Imagenet, using Adam with $\alpha = 10^{-3}$ and $\beta_2 = 0.999$.
  • Figure 5: Mean validation perplexity (3 seeds) of Transformer LM on WIKITEXT-103, using Adam with $\alpha = 10^{-4}$ and $\beta_2 = 0.999$.
  • ...and 3 more figures

Theorems & Definitions (2)

  • proof
  • proof