Schedulers for Schedule-free: Theoretically inspired hyperparameters
Yuen-Man Pun, Matthew Buchholz, Robert M. Gower
TL;DR
The paper addresses hyperparameter tuning challenges in deep learning by extending the schedule-free SGD theory from constant learning rates to arbitrary schedules, and by introducing averaging rules that align with the learning-rate schedule. It presents schedulet, a theoretically motivated averaging scheme for non-constant schedules, and shows that the warmup-stable-decay (wsd) schedule yields the optimal $O(DG/\sqrt{T})$ convergence under convex-Lipschitz assumptions. It then proposes schedulep, a Polyak-based adaptive stepsize with an interpolation assumption, achieving an anytime last-iterate convergence bound of $O(GD/\sqrt{t})$ and demonstrating competitiveness in black-box distillation tasks. Empirically, the theory-guided approaches predict training dynamics on ResNet-20/CIFAR-10 and prove robust in distillation experiments across TinyShakespeare and fineweb1B, offering a practically impactful framework for non-constant schedulers in schedule-free optimization. Collectively, the work provides a theoretically grounded path to effective, low-tuning optimization across vision and language tasks while clarifying the relationship between convex theory and non-convex deep learning dynamics.
Abstract
The recently proposed schedule-free method has been shown to achieve strong performance when hyperparameter tuning is limited. The current theory for schedule-free only supports a constant learning rate, where-as the implementation used in practice uses a warm-up schedule. We show how to extend the last-iterate convergence theory of schedule-free to allow for any scheduler, and how the averaging parameter has to be updated as a function of the learning rate. We then perform experiments showing how our convergence theory has some predictive power with regards to practical executions on deep neural networks, despite that this theory relies on assuming convexity. When applied to the warmup-stable-decay (wsd) schedule, our theory shows the optimal convergence rate of $\mathcal{O}(1/\sqrt{T})$. We then use convexity to design a new adaptive Polyak learning rate schedule for schedule-free. We prove an optimal anytime last-iterate convergence for our new Polyak schedule, and show that it performs well compared to a number of baselines on a black-box model distillation task.
