Table of Contents
Fetching ...

Schedulers for Schedule-free: Theoretically inspired hyperparameters

Yuen-Man Pun, Matthew Buchholz, Robert M. Gower

TL;DR

The paper addresses hyperparameter tuning challenges in deep learning by extending the schedule-free SGD theory from constant learning rates to arbitrary schedules, and by introducing averaging rules that align with the learning-rate schedule. It presents schedulet, a theoretically motivated averaging scheme for non-constant schedules, and shows that the warmup-stable-decay (wsd) schedule yields the optimal $O(DG/\sqrt{T})$ convergence under convex-Lipschitz assumptions. It then proposes schedulep, a Polyak-based adaptive stepsize with an interpolation assumption, achieving an anytime last-iterate convergence bound of $O(GD/\sqrt{t})$ and demonstrating competitiveness in black-box distillation tasks. Empirically, the theory-guided approaches predict training dynamics on ResNet-20/CIFAR-10 and prove robust in distillation experiments across TinyShakespeare and fineweb1B, offering a practically impactful framework for non-constant schedulers in schedule-free optimization. Collectively, the work provides a theoretically grounded path to effective, low-tuning optimization across vision and language tasks while clarifying the relationship between convex theory and non-convex deep learning dynamics.

Abstract

The recently proposed schedule-free method has been shown to achieve strong performance when hyperparameter tuning is limited. The current theory for schedule-free only supports a constant learning rate, where-as the implementation used in practice uses a warm-up schedule. We show how to extend the last-iterate convergence theory of schedule-free to allow for any scheduler, and how the averaging parameter has to be updated as a function of the learning rate. We then perform experiments showing how our convergence theory has some predictive power with regards to practical executions on deep neural networks, despite that this theory relies on assuming convexity. When applied to the warmup-stable-decay (wsd) schedule, our theory shows the optimal convergence rate of $\mathcal{O}(1/\sqrt{T})$. We then use convexity to design a new adaptive Polyak learning rate schedule for schedule-free. We prove an optimal anytime last-iterate convergence for our new Polyak schedule, and show that it performs well compared to a number of baselines on a black-box model distillation task.

Schedulers for Schedule-free: Theoretically inspired hyperparameters

TL;DR

The paper addresses hyperparameter tuning challenges in deep learning by extending the schedule-free SGD theory from constant learning rates to arbitrary schedules, and by introducing averaging rules that align with the learning-rate schedule. It presents schedulet, a theoretically motivated averaging scheme for non-constant schedules, and shows that the warmup-stable-decay (wsd) schedule yields the optimal convergence under convex-Lipschitz assumptions. It then proposes schedulep, a Polyak-based adaptive stepsize with an interpolation assumption, achieving an anytime last-iterate convergence bound of and demonstrating competitiveness in black-box distillation tasks. Empirically, the theory-guided approaches predict training dynamics on ResNet-20/CIFAR-10 and prove robust in distillation experiments across TinyShakespeare and fineweb1B, offering a practically impactful framework for non-constant schedulers in schedule-free optimization. Collectively, the work provides a theoretically grounded path to effective, low-tuning optimization across vision and language tasks while clarifying the relationship between convex theory and non-convex deep learning dynamics.

Abstract

The recently proposed schedule-free method has been shown to achieve strong performance when hyperparameter tuning is limited. The current theory for schedule-free only supports a constant learning rate, where-as the implementation used in practice uses a warm-up schedule. We show how to extend the last-iterate convergence theory of schedule-free to allow for any scheduler, and how the averaging parameter has to be updated as a function of the learning rate. We then perform experiments showing how our convergence theory has some predictive power with regards to practical executions on deep neural networks, despite that this theory relies on assuming convexity. When applied to the warmup-stable-decay (wsd) schedule, our theory shows the optimal convergence rate of . We then use convexity to design a new adaptive Polyak learning rate schedule for schedule-free. We prove an optimal anytime last-iterate convergence for our new Polyak schedule, and show that it performs well compared to a number of baselines on a black-box model distillation task.

Paper Structure

This paper contains 37 sections, 18 theorems, 133 equations, 10 figures, 2 tables, 3 algorithms.

Key Result

theorem 0

Let $f\colon\mathbb{R}^d \to\mathbb{R}$ be convex and $G$-Lipschitz continuous. Let $\{\bm{x}_t,\bm{y}_t,\bm{z}_t\}$ be generated from eq:y-sfsgd, eq:z-sfsgd, eq:x-sfsgd. Suppose that for $t=1,\ldots,T$. Initializing $\bm{z}_{-1} = \bm{x}_0$, we then have

Figures (10)

  • Figure 1: Our theory (Theorem \ref{['thm:conv-rate']}) is good at predicting the behavior of the training loss: The plots show the theoretical bound and the training loss of ResNet-20/Cifar10 when using wsd schedules with base learning rate $\gamma = 10$ and three different cooldown lengths. The gradient norm over the iteration is shown on the rightmost figure for reference. The red color denotes the warmup period, the gray color denotes the constant period, and the blue color denotes the cooldown period.
  • Figure 2: Training loss for schedule-free on ResNet-20 /Cifar10 with a constant learning rate schedule (gray), warmup-stable (red-gray), and wsd schedule (red-gray-blue).
  • Figure 3: Using wsd schedules with three different cooldown periods and with base learning rate $\gamma = 0.01$, our plots compare the theoretical convergence (Theorem \ref{['thm:conv-rate']}) to the empirical convergence of ResNet-20/Cifar10, with the gradient norm shown for reference. The red color denotes the warmup period, the gray color denotes the constant period, and the blue color denotes the cooldown period.
  • Figure 4: The averaging parameter $c_t$ when applied with the wsd schedule where blue is our proposed $c_t = \eta_t/\sum_{i=0}^t\eta_i$, gray is $c_t = 1/t$, and the orange is the practical heuristic $c_t = \eta_t^2/\sum_{i=0}^t\eta_i^2$.
  • Figure 5: Using schedules with three three different diverging periods, we compare the theoretical convergence given by Theorem \ref{['thm:conv-rate']} to the empirical convergence of ResNet-20/Cifar10. The gray color denotes the constant period and the red color denotes the diverging period.
  • ...and 5 more figures

Theorems & Definitions (31)

  • theorem 0
  • lemma 0
  • corollary 0
  • theorem 1
  • lemma 2
  • proof
  • lemma 3
  • proof
  • theorem 3
  • proof
  • ...and 21 more