Table of Contents
Fetching ...

Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective

Kaiyue Wen, Zhiyuan Li, Jason Wang, David Hall, Percy Liang, Tengyu Ma

TL;DR

The paper addresses the inflexibility of fixed-budget cosine learning-rate schedules by proposing Warmup-Stable-Decay (WSD), which maintains a constant learning rate along a main training branch and uses a rapid decay to generate intermediate checkpoints, enabling multi-budget outcomes from a single run. It introduces a river valley loss-landscape framework to explain why a high learning-rate phase accelerates progress along the river while a subsequent decay phase reduces oscillations and reveals true progress along the valley. Theoretical results show SGD and GD dynamics on river valleys track a dominant river direction, with stable phases enhancing river progress and decay phases reducing hill oscillations, complemented by toy data and GPT-2 visualizations. Building on this theory, the paper proposes WSD-S, a continual-learning variant that restarts from decayed checkpoints, reusing decay phases, and demonstrates competitive or superior performance across 0.1B–1.2B parameter language models compared to cosine-based and cyclic schedules. The findings suggest a practical, compute-efficient pathway for continual pretraining and checkpoint generation, with broad implications for scaling laws, data-heterogeneity, and loss-landscape-informed optimization strategies.

Abstract

Training language models currently requires pre-determining a fixed compute budget because the typical cosine learning rate schedule depends on the total number of steps. In contrast, the Warmup-Stable-Decay (WSD) schedule uses a constant learning rate to produce a main branch of iterates that can in principle continue indefinitely without a pre-specified compute budget. Then, given any compute budget, one can branch out from the main branch at a proper time with a rapidly decaying learning rate to produce a strong model. Empirically, WSD generates a non-traditional loss curve: the loss remains elevated during the stable phase but sharply declines during the decay phase. Towards explaining this phenomenon, we conjecture that pretraining loss exhibits a river valley landscape, which resembles a deep valley with a river at its bottom. Under this assumption, we show that during the stable phase, the iterate undergoes large oscillations due to the high learning rate, yet it progresses swiftly along the river. During the decay phase, the rapidly dropping learning rate minimizes the iterate's oscillations, moving it closer to the river and revealing true optimization progress. Therefore, the sustained high learning rate phase and fast decaying phase are responsible for progress in the river and the mountain directions respectively, and are both critical. Our analysis predicts phenomenons consistent with empirical observations and shows that this landscape can emerge from pretraining on a simple bi-gram dataset. Inspired by the theory, we introduce WSD-S, a variant of WSD that reuses previous checkpoints' decay phases and keeps only one main branch, where we resume from a decayed checkpoint. WSD-S empirically outperforms WSD and Cyclic-Cosine in obtaining multiple language model checkpoints across various compute budgets in a single run for parameters scaling from 0.1B to 1.2B.

Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective

TL;DR

The paper addresses the inflexibility of fixed-budget cosine learning-rate schedules by proposing Warmup-Stable-Decay (WSD), which maintains a constant learning rate along a main training branch and uses a rapid decay to generate intermediate checkpoints, enabling multi-budget outcomes from a single run. It introduces a river valley loss-landscape framework to explain why a high learning-rate phase accelerates progress along the river while a subsequent decay phase reduces oscillations and reveals true progress along the valley. Theoretical results show SGD and GD dynamics on river valleys track a dominant river direction, with stable phases enhancing river progress and decay phases reducing hill oscillations, complemented by toy data and GPT-2 visualizations. Building on this theory, the paper proposes WSD-S, a continual-learning variant that restarts from decayed checkpoints, reusing decay phases, and demonstrates competitive or superior performance across 0.1B–1.2B parameter language models compared to cosine-based and cyclic schedules. The findings suggest a practical, compute-efficient pathway for continual pretraining and checkpoint generation, with broad implications for scaling laws, data-heterogeneity, and loss-landscape-informed optimization strategies.

Abstract

Training language models currently requires pre-determining a fixed compute budget because the typical cosine learning rate schedule depends on the total number of steps. In contrast, the Warmup-Stable-Decay (WSD) schedule uses a constant learning rate to produce a main branch of iterates that can in principle continue indefinitely without a pre-specified compute budget. Then, given any compute budget, one can branch out from the main branch at a proper time with a rapidly decaying learning rate to produce a strong model. Empirically, WSD generates a non-traditional loss curve: the loss remains elevated during the stable phase but sharply declines during the decay phase. Towards explaining this phenomenon, we conjecture that pretraining loss exhibits a river valley landscape, which resembles a deep valley with a river at its bottom. Under this assumption, we show that during the stable phase, the iterate undergoes large oscillations due to the high learning rate, yet it progresses swiftly along the river. During the decay phase, the rapidly dropping learning rate minimizes the iterate's oscillations, moving it closer to the river and revealing true optimization progress. Therefore, the sustained high learning rate phase and fast decaying phase are responsible for progress in the river and the mountain directions respectively, and are both critical. Our analysis predicts phenomenons consistent with empirical observations and shows that this landscape can emerge from pretraining on a simple bi-gram dataset. Inspired by the theory, we introduce WSD-S, a variant of WSD that reuses previous checkpoints' decay phases and keeps only one main branch, where we resume from a decayed checkpoint. WSD-S empirically outperforms WSD and Cyclic-Cosine in obtaining multiple language model checkpoints across various compute budgets in a single run for parameters scaling from 0.1B to 1.2B.
Paper Structure (29 sections, 40 theorems, 191 equations, 13 figures, 3 tables)

This paper contains 29 sections, 40 theorems, 191 equations, 13 figures, 3 tables.

Key Result

Theorem 3.2

If a loss $L$ is a river valley (def:river-valley), for the gradient flow $w(t)$ defined in eq:gfflow, the iterate will obey the following dynamics:

Figures (13)

  • Figure 1: The Distinctive Loss Curve produced by $\textit{WSD}$. A constant learning rate phase, characterized by slow loss improvements, eventually leads to better validation loss after learning rate decay.
  • Figure 2: We theoretically analyze the Warmup-Stable-Decay ($\textit{WSD}$) schedule and demonstrate a river valley loss lanscape model to explain its effectiveness (demonstrated in \ref{['fig:landscape']}). The stable phase adopts a large learning rate and the iterate will progress along the river while oscillating between the sharp hillsides. Due to the large oscillation caused by the large learning rate, the run will potentially show a higher loss compared to a run using smaller learning rate in this phase. During the decay phase, the learning rate is dropped rapidly to ease the oscillation of the iterates, driving it closer to the river, revealing the optimization progress. Based on our theory, we propose $\textit{WSD-S}$implified ($\textit{WSD-S}$), an effective simplification of the $\textit{WSD}$ schedule in continual learning, where we start directly using a high learning rate from previous intermediate checkpoints. We visualize the learning rate schedule in \ref{['fig:learningrate']}. The arrow in the second row of \ref{['fig:learningrate']} indicates $\textit{WSD}$ reinitializes the checkpoint from the last checkpoint from the constant learning rate phase instead.
  • Figure 3: A River Valley Loss Landscape and the Optimization Dynamics with Various Learning Rates.
  • Figure 4: Illustration of the Definition of the River.
  • Figure 5: Illustration of Theory. We validate our theory using a 2D example. The blue curve represents the "river", where the gradient aligns with the minimal eigenvector of the Hessian. (Left) Randomly initialized gradient flows converge near the river and follow it closely thereafter. (Middle) Discrete step-size gradient descent shows similar behavior: after initial oscillations, the gradient descent iterates align closely with their projections on the river. (Right) Stochastic gradient descent (SGD) also tracks the river. In contrast to the discrete-step gradient descent, the iterates oscillate around the river rather than staying on it. The trajectory with a larger learning rate exhibits faster progress and greater oscillations than trajectory with a smaller learning rate.
  • ...and 8 more figures

Theorems & Definitions (79)

  • Definition 3.1: River Valley Landscape
  • Theorem 3.2
  • Theorem 3.3
  • Theorem 3.4
  • Theorem 3.5
  • Lemma 4.1
  • Lemma A.1
  • proof
  • Lemma A.2
  • proof
  • ...and 69 more