Table of Contents
Fetching ...

Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training

Minhak Song, Beomhan Baek, Kwangjun Ahn, Chulhee Yun

TL;DR

Large-language-model pretraining with fixed compute budgets challenges traditional decaying-learning-rate strategies. We evaluate Schedule-Free (SF) methods, especially SF-AdamW, and show they navigate the river-valley loss landscape without explicit LR decay or weight averaging, effectively behaving as implicit weight averaging and operating near the Edge of Stability. A reformulation reveals SF's updates function as momentum on a river-aligned path, and a refined variant decouples momentum from averaging via a decoupling parameter $C$, improving robustness to momentum and large-batch scaling. Empirical LM experiments with a 124M-parameter model demonstrate competitive perplexity and stable training with minimal memory overhead, validating SF as a practical, scalable optimizer for language-model pretraining.

Abstract

As both model and dataset sizes continue to scale rapidly, conventional pretraining strategies with fixed compute budgets-such as cosine learning rate schedules-are increasingly inadequate for large-scale training. Recent alternatives, including warmup-stable-decay (WSD) schedules and weight averaging, offer greater flexibility. However, WSD relies on explicit decay phases to track progress, while weight averaging addresses this limitation at the cost of additional memory. In search of a more principled and scalable alternative, we revisit the Schedule-Free (SF) method [Defazio et al., 2024], which has shown strong empirical performance across diverse settings. We show that SF-AdamW effectively navigates the "river" structure of the loss landscape without decay phases or auxiliary averaging, making it particularly suitable for continuously scaling training workloads. To understand this behavior, we conduct a theoretical and empirical analysis of SF dynamics, revealing that it implicitly performs weight averaging without memory overhead. Guided by this analysis, we propose a refined variant of SF that improves robustness to momentum and performs better under large batch sizes, addressing key limitations of the original method. Together, these results establish SF as a practical, scalable, and theoretically grounded approach for language model training.

Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training

TL;DR

Large-language-model pretraining with fixed compute budgets challenges traditional decaying-learning-rate strategies. We evaluate Schedule-Free (SF) methods, especially SF-AdamW, and show they navigate the river-valley loss landscape without explicit LR decay or weight averaging, effectively behaving as implicit weight averaging and operating near the Edge of Stability. A reformulation reveals SF's updates function as momentum on a river-aligned path, and a refined variant decouples momentum from averaging via a decoupling parameter , improving robustness to momentum and large-batch scaling. Empirical LM experiments with a 124M-parameter model demonstrate competitive perplexity and stable training with minimal memory overhead, validating SF as a practical, scalable optimizer for language-model pretraining.

Abstract

As both model and dataset sizes continue to scale rapidly, conventional pretraining strategies with fixed compute budgets-such as cosine learning rate schedules-are increasingly inadequate for large-scale training. Recent alternatives, including warmup-stable-decay (WSD) schedules and weight averaging, offer greater flexibility. However, WSD relies on explicit decay phases to track progress, while weight averaging addresses this limitation at the cost of additional memory. In search of a more principled and scalable alternative, we revisit the Schedule-Free (SF) method [Defazio et al., 2024], which has shown strong empirical performance across diverse settings. We show that SF-AdamW effectively navigates the "river" structure of the loss landscape without decay phases or auxiliary averaging, making it particularly suitable for continuously scaling training workloads. To understand this behavior, we conduct a theoretical and empirical analysis of SF dynamics, revealing that it implicitly performs weight averaging without memory overhead. Guided by this analysis, we propose a refined variant of SF that improves robustness to momentum and performs better under large batch sizes, addressing key limitations of the original method. Together, these results establish SF as a practical, scalable, and theoretically grounded approach for language model training.

Paper Structure

This paper contains 47 sections, 5 theorems, 50 equations, 15 figures, 6 tables, 2 algorithms.

Key Result

Proposition 4.0

Consider running SF-GD on a quadratic objective $f({\mathbf w}) = \tfrac{1}{2}{\mathbf w}^\top \mathbf H {\mathbf w} + {\mathbf g}^\top{\mathbf w} + c$. If $\lambda_1(\mathbf H) > \tfrac{2}{(1-\beta)\gamma}$, then the iterates $\{({\mathbf x}_t, {\mathbf y}_t, {\mathbf z}_t)\}$ diverge.

Figures (15)

  • Figure 1: River-valley structure in a toy loss landscape. Contour plot of the objective defined in \ref{['sec:toy']}, illustrating the flat river direction and steep hill direction characteristic of the river-valley geometry.
  • Figure 2: SF-AdamW closely follows the river, unlike AdamW.Left, Middle: While AdamW benefits from linear LR decay and EWA, SF-AdamW shows no improvement from either. Right: A short decay phase of AdamW (with linear LR decay from 1e-4 to 0) leads to a sharp loss drop for AdamW, but has no effect when applied to the SF-AdamW trajectory---suggesting that SF-AdamW already tracks the river throughout training (\ref{['obs:sf_river']}).
  • Figure 3: Linear interpolation between training checkpoints. We evaluate the loss along linear interpolations $\alpha {\mathbf w}_{t_1} + (1{-}\alpha){\mathbf w}_{t_2}$, where $\alpha \in [0, 1]$ and $t_1$, $t_2$ denote the 2B and 2.5B token checkpoints, respectively. We compare three training regimes: (1) AdamW with constant learning rate (LR), (2) AdamW with a linear LR decay to zero, and (3) SF-AdamW with constant LR. For all settings, the first 2B tokens are trained using either constant-LR AdamW (for 1 and 2) or constant-LR SF-AdamW (for 3). The resulting curves exhibit qualitatively distinct behaviors: convex (valley-shaped) for (1), sharp monotonic decay for (2), and flat, slowly declining loss for (3) (\ref{['obs:sf_river']}).
  • Figure 4: SF-AdamW with suboptimal momentum fails to follow the river. A short decay phase of AdamW applied to SF-AdamW checkpoints with $\beta_1 \in \{0.1, 0.5\}$ results in a sharp loss drop, unlike the case with $\beta_1 = 0.95$ (\ref{['obs:sensitive-momentum']}).
  • Figure 5: SF-AdamW on toy model.Left: The ${\mathbf x}_t$ iterates fail to follow the river for $\beta_1 \in \{0.1, 0.5\}$ (\ref{['obs:sensitive-momentum']}). Right: The ${\mathbf y}_t$ iterates oscillate around the river but track it reliably on average, even for suboptimal values of $\beta_1$ (\ref{['obs:y']}). As $\beta_1$ increases, the oscillations shrink.
  • ...and 10 more figures

Theorems & Definitions (8)

  • Proposition 4.0: Stability Threshold of SF-GD
  • Proposition 4.0: Stability Threshold of SF-PrecondGDW
  • Proposition D.0: Stability Threshold of SF-GD
  • proof
  • Lemma D.1
  • proof
  • Proposition D.1: Stability Threshold of SF-PrecondGDW
  • proof