What do near-optimal learning rate schedules look like?

Hiroki Naganuma; Atish Agarwala; Priya Kasimbeg; George E. Dahl

What do near-optimal learning rate schedules look like?

Hiroki Naganuma, Atish Agarwala, Priya Kasimbeg, George E. Dahl

TL;DR

A search procedure to find the best shapes within a parameterized schedule family, and it is found that warmup and decay are robust features of good schedules, and that commonly used schedule families are not optimal on these workloads.

Abstract

A basic unanswered question in neural network training is: what is the best learning rate schedule shape for a given workload? The choice of learning rate schedule is a key factor in the success or failure of the training process, but beyond having some kind of warmup and decay, there is no consensus on what makes a good schedule shape. To answer this question, we designed a search procedure to find the best shapes within a parameterized schedule family. Our approach factors out the schedule shape from the base learning rate, which otherwise would dominate cross-schedule comparisons. We applied our search procedure to a variety of schedule families on three workloads: linear regression, image classification on CIFAR-10, and small-scale language modeling on Wikitext103. We showed that our search procedure indeed generally found near-optimal schedules. We found that warmup and decay are robust features of good schedules, and that commonly used schedule families are not optimal on these workloads. Finally, we explored how the outputs of our shape search depend on other optimization hyperparameters, and found that weight decay can have a strong effect on the optimal schedule shape. To the best of our knowledge, our results represent the most comprehensive results on near-optimal schedule shapes for deep neural network training, to date.

What do near-optimal learning rate schedules look like?

TL;DR

Abstract

Paper Structure (40 sections, 9 equations, 21 figures, 10 tables)

This paper contains 40 sections, 9 equations, 21 figures, 10 tables.

Introduction
Related Work
Methods
Learning Rate Schedule Families
Workloads and Experimental Setup
Workloads
Optimization-limited Regime and Training Setup
Search Procedure
Search step.
Evaluation step.
Results
Linear regression: test case with ground truth
Near-optimal schedules for CIFAR-10 and WikiText-103 workloads
Base learning rate is the most important factor for a good schedule.
Warmup and monotonic decay are both crucial.
...and 25 more sections

Figures (21)

Figure 1: Learning rate schedule families used in our experiments: Constant, Cosine, Generalized Cosine, Generalized Rex, Smooth Non-Monotonic, Square-root Decay, Two-Point Spline, and Two-Point Linear. Markers identify key points such as initial learning rate, warmup completion, intermediate control points, and end of training. Numerical annotations specify parameters particular to each schedule. Peak of Smooth Non-Monotonic can occur in any order compared to control points unlike other schedules.
Figure 2: For a linear regression workload, schedules found via random search capture some of the features of the theoretically optimal schedule but fail to match if completely (left). Average losses appear to be better than theoretically optimal; however, when re-evaluating with $1000$ seeds, searched schedules match theoretical prediction and are slightly worse than optimal (right).
Figure 3: Training metrics versus base learning rate for best schedules in each family for linear regression (left), CIFAR-10 (middle), and WikiText-103 (right). Each row corresponds to a schedule family, each column to a learning rate, and lighter colors correspond to better performance. Base learning rate is far more important for success than schedule identity, with the exception of the Constant schedule which performs worse in all cases.
Figure 4: Best Smooth Non-Monotonic family member does not match optimal schedule (left, blue). We can numerically solve for the best fit curve in the family (left, orange). This obtains very similar performance to the theoretical optimal schedule (right).
Figure 5: Near optimal learning rate schedules for CIFAR-10 (left) and WikiText-103 (right). The curves represent the best absolute learning rate schedules for each family found in our search procedure which minimized final train error (CIFAR-10) or train perplexity (WikiText-103). All curves show similar warmup and decay patterns in each workload, including Smooth Non-Monotonic family which does not guaranteed to have those properties. More flexible families perform better than Constant and Cosine.
...and 16 more figures

What do near-optimal learning rate schedules look like?

TL;DR

Abstract

What do near-optimal learning rate schedules look like?

Authors

TL;DR

Abstract

Table of Contents

Figures (21)