Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training
Minhak Song, Beomhan Baek, Kwangjun Ahn, Chulhee Yun
TL;DR
Large-language-model pretraining with fixed compute budgets challenges traditional decaying-learning-rate strategies. We evaluate Schedule-Free (SF) methods, especially SF-AdamW, and show they navigate the river-valley loss landscape without explicit LR decay or weight averaging, effectively behaving as implicit weight averaging and operating near the Edge of Stability. A reformulation reveals SF's updates function as momentum on a river-aligned path, and a refined variant decouples momentum from averaging via a decoupling parameter $C$, improving robustness to momentum and large-batch scaling. Empirical LM experiments with a 124M-parameter model demonstrate competitive perplexity and stable training with minimal memory overhead, validating SF as a practical, scalable optimizer for language-model pretraining.
Abstract
As both model and dataset sizes continue to scale rapidly, conventional pretraining strategies with fixed compute budgets-such as cosine learning rate schedules-are increasingly inadequate for large-scale training. Recent alternatives, including warmup-stable-decay (WSD) schedules and weight averaging, offer greater flexibility. However, WSD relies on explicit decay phases to track progress, while weight averaging addresses this limitation at the cost of additional memory. In search of a more principled and scalable alternative, we revisit the Schedule-Free (SF) method [Defazio et al., 2024], which has shown strong empirical performance across diverse settings. We show that SF-AdamW effectively navigates the "river" structure of the loss landscape without decay phases or auxiliary averaging, making it particularly suitable for continuously scaling training workloads. To understand this behavior, we conduct a theoretical and empirical analysis of SF dynamics, revealing that it implicitly performs weight averaging without memory overhead. Guided by this analysis, we propose a refined variant of SF that improves robustness to momentum and performs better under large batch sizes, addressing key limitations of the original method. Together, these results establish SF as a practical, scalable, and theoretically grounded approach for language model training.
