Table of Contents
Fetching ...

The Road Less Scheduled

Aaron Defazio, Xingyu Alice Yang, Harsh Mehta, Konstantin Mishchenko, Ahmed Khaled, Ashok Cutkosky

TL;DR

The paper tackles the dependence on predefined learning-rate schedules by introducing Schedule-Free SGD and Schedule-Free AdamW, a method that eliminates the need to specify a stopping time T yet matches or surpasses schedule-based performance across convex and large-scale deep learning problems. It builds a unifying online-to-batch framework that interpolates between Polyak-Ruppert averaging and primal averaging via a momentum parameter, delivering strong theoretical guarantees and practical stability. Extensive experiments across 28 problems and benchmarks, including MLCommons AlgoPerf, demonstrate robust gains over tuned schedules with no extra hyperparameters beyond momentum, though some tasks require careful hyperparameter sweeps and BN adjustments. The approach offers a scalable, open-source alternative to scheduling in optimization for deep learning and large-scale convex problems, with broad implications for training efficiency and reliability.

Abstract

Existing learning rate schedules that do not require specification of the optimization stopping step T are greatly out-performed by learning rate schedules that depend on T. We propose an approach that avoids the need for this stopping time by eschewing the use of schedules entirely, while exhibiting state-of-the-art performance compared to schedules across a wide family of problems ranging from convex problems to large-scale deep learning problems. Our Schedule-Free approach introduces no additional hyper-parameters over standard optimizers with momentum. Our method is a direct consequence of a new theory we develop that unifies scheduling and iterate averaging. An open source implementation of our method is available at https://github.com/facebookresearch/schedule_free. Schedule-Free AdamW is the core algorithm behind our winning entry to the MLCommons 2024 AlgoPerf Algorithmic Efficiency Challenge Self-Tuning track.

The Road Less Scheduled

TL;DR

The paper tackles the dependence on predefined learning-rate schedules by introducing Schedule-Free SGD and Schedule-Free AdamW, a method that eliminates the need to specify a stopping time T yet matches or surpasses schedule-based performance across convex and large-scale deep learning problems. It builds a unifying online-to-batch framework that interpolates between Polyak-Ruppert averaging and primal averaging via a momentum parameter, delivering strong theoretical guarantees and practical stability. Extensive experiments across 28 problems and benchmarks, including MLCommons AlgoPerf, demonstrate robust gains over tuned schedules with no extra hyperparameters beyond momentum, though some tasks require careful hyperparameter sweeps and BN adjustments. The approach offers a scalable, open-source alternative to scheduling in optimization for deep learning and large-scale convex problems, with broad implications for training efficiency and reliability.

Abstract

Existing learning rate schedules that do not require specification of the optimization stopping step T are greatly out-performed by learning rate schedules that depend on T. We propose an approach that avoids the need for this stopping time by eschewing the use of schedules entirely, while exhibiting state-of-the-art performance compared to schedules across a wide family of problems ranging from convex problems to large-scale deep learning problems. Our Schedule-Free approach introduces no additional hyper-parameters over standard optimizers with momentum. Our method is a direct consequence of a new theory we develop that unifies scheduling and iterate averaging. An open source implementation of our method is available at https://github.com/facebookresearch/schedule_free. Schedule-Free AdamW is the core algorithm behind our winning entry to the MLCommons 2024 AlgoPerf Algorithmic Efficiency Challenge Self-Tuning track.
Paper Structure (38 sections, 11 theorems, 78 equations, 12 figures, 1 algorithm)

This paper contains 38 sections, 11 theorems, 78 equations, 12 figures, 1 algorithm.

Key Result

Theorem 1

Suppose $F$ is a convex function, and $\zeta_1,\dots,\zeta_T$ is an i.i.d. sequence of random variables such that $F=\mathbb{E}[f(x,\zeta)]$ for some function $f$ that is $G$-Lipschitz in $x$. For any minimizer $x_\star$, define $D=\left\Vert x_{1}-x_\star\right\Vert$ and $\gamma=D/(G\sqrt{T})$. The

Figures (12)

  • Figure 1: Schedule-Free methods (black) closely track the Pareto frontier of loss v.s. training time in a single run. Both Schedule-Free SGD (left) and AdamW (right) match or exceed the performance of cosine learning rate schedules of varying lengths (red).
  • Figure 2: Schedule-Free learning converges faster than classical averaging approaches, often out-performing tuned schedules. Existing averaging approaches such as Polyak and Primal averaging significantly under-perform schedules.
  • Figure 3: Incorporating the momentum parameter $\beta$ allows for convergence despite using larger learning rates $\gamma$ on quadratic problems. Dark region indicates convergence.
  • Figure 4: Illustration of the contribution of the gradient at each time step to the gradient location sequence $y$ and the returned evaluation sequence $x$. The horizontal axis is the time-step, and the vertical axis is the fraction of the gradient from each time-step incorporated into the iterate sequence.
  • Figure 5: Deep Learning Experiments
  • ...and 7 more figures

Theorems & Definitions (19)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 3
  • proof
  • Theorem 4
  • proof
  • Theorem 5
  • proof
  • Corollary 1
  • ...and 9 more