Why Warmup the Learning Rate? Underlying Mechanisms and Improvements

Dayal Singh Kalra; Maissam Barkeshli

Why Warmup the Learning Rate? Underlying Mechanisms and Improvements

Dayal Singh Kalra, Maissam Barkeshli

TL;DR

It is shown how $\eta_{\text{init}}$ can be properly chosen by utilizing the loss catapult mechanism, which saves on the number of warmup steps, in some cases completely eliminating the need for warmup.

Abstract

It is common in deep learning to warm up the learning rate $η$, often by a linear schedule between $η_{\text{init}} = 0$ and a predetermined target $η_{\text{trgt}}$. In this paper, we show through systematic experiments using SGD and Adam that the overwhelming benefit of warmup arises from allowing the network to tolerate larger $η_{\text{trgt}}$ {by forcing the network to more well-conditioned areas of the loss landscape}. The ability to handle larger $η_{\text{trgt}}$ makes hyperparameter tuning more robust while improving the final performance. We uncover different regimes of operation during the warmup period, depending on whether training starts off in a progressive sharpening or sharpness reduction phase, which in turn depends on the initialization and parameterization. Using these insights, we show how $η_{\text{init}}$ can be properly chosen by utilizing the loss catapult mechanism, which saves on the number of warmup steps, in some cases completely eliminating the need for warmup. We also suggest an initialization for the variance in Adam which provides benefits similar to warmup.

Why Warmup the Learning Rate? Underlying Mechanisms and Improvements

TL;DR

It is shown how

can be properly chosen by utilizing the loss catapult mechanism, which saves on the number of warmup steps, in some cases completely eliminating the need for warmup.

Abstract

It is common in deep learning to warm up the learning rate

, often by a linear schedule between

and a predetermined target

. In this paper, we show through systematic experiments using SGD and Adam that the overwhelming benefit of warmup arises from allowing the network to tolerate larger

{by forcing the network to more well-conditioned areas of the loss landscape}. The ability to handle larger

makes hyperparameter tuning more robust while improving the final performance. We uncover different regimes of operation during the warmup period, depending on whether training starts off in a progressive sharpening or sharpness reduction phase, which in turn depends on the initialization and parameterization. Using these insights, we show how

can be properly chosen by utilizing the loss catapult mechanism, which saves on the number of warmup steps, in some cases completely eliminating the need for warmup. We also suggest an initialization for the variance in Adam which provides benefits similar to warmup.

Paper Structure (74 sections, 14 equations, 67 figures, 1 table, 3 algorithms)

This paper contains 74 sections, 14 equations, 67 figures, 1 table, 3 algorithms.

Introduction
Our contributions.
Notations and Preliminaries
Overview of Training Instabilities and the Self-Stabilization Mechanism
Warmup Mechanisms of Gradient and Adaptive Methods
Stochastic Gradient Descent
Stochastic Gradient Descent with Momentum (SGD-M)
Adaptive Gradient Methods (Adam)
Impact of Warmup on Training and Generalization
Stochastic Gradient Descent (SGD)
Adaptive Gradient Methods (Adam)
Improved Hyperparameter Initialization Schemes for Optimizers
Discussion
Practical Guidance for Practitioners
How to Select the Warmup Duration?
...and 59 more sections

Figures (67)

Figure 7: Comparison of persistent catapult warmup (in black) with linear warmup with different durations. The experimental setup is the same as in \ref{['fig:mechanisms_fcns_mse_sgd_B5000_cifar10']}, but the model is trained on the entire CIFAR-10 dataset using SGD with a batch size $B=512$.
Figure 8: Training loss and sharpness trajectories of FCNs trained on CIFAR-10 with MSE loss using SGD with a batch size $B=512$. The dashed lines in the sharpness figures illustrate the instability thresholds $2/\eta_t$. (top) $\mu$P with learning rate $1/\lambda_0^H$, (bottom) SP with learning rate $32/\lambda_0^H$.
Figure 9: Training loss and sharpness trajectories of FCNs trained on $5$k subset of CIFAR-10 using MSE loss and full batch GD with momentum $\beta = 0.9$: (top) $\mu$P with learning rate $1/\lambda_0^H$ (middle) SP with learning rate $1/\lambda_0^H$, and (bottom) SP with learning rate $32/\lambda_0^H$. The dotted lines in the sharpness figures correspond to the $(2 + 2 \beta)/\eta_t$ curves, while dashed lines show the $2/\eta_t$ for reference.
Figure 10: Training loss and sharpness trajectories of FCNs trained on CIFAR-10 with MSE loss using SGD with a batch size $B=512$ and momentum $\beta = 0.9$: (top) $\mu$P with learning rate $1/\lambda_0^H$, and (bottom) SP with learning rate $32/\lambda_0^H$. The dotted lines in the sharpness figures correspond to the $(2 + 2 \beta)/\eta_t$ curves, while dashed lines show the $2/\eta_t$ for reference. Similar mechanisms are observed for cross-entropy loss with a decrease in sharpness at late training times, as detailed in \ref{['appendix:warmup_mechanisms_cross_entropy']}.
Figure 11: Training loss and sharpness trajectories of FCNs trained on CIFAR-10 with cross-entropy loss using SGD with a batch size $B=512$. (Top row) $\mu$P with learning rate $1/\lambda_0^H$ (Bottom row) SP with learning rate $32/\lambda_0^H$.
...and 62 more figures

Why Warmup the Learning Rate? Underlying Mechanisms and Improvements

TL;DR

Abstract

Why Warmup the Learning Rate? Underlying Mechanisms and Improvements

Authors

TL;DR

Abstract

Table of Contents

Figures (67)