Table of Contents
Fetching ...

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

Kairong Luo, Zhenbo Sun, Haodong Wen, Xinyu Shi, Jiarui Cui, Chenyi Dang, Kaifeng Lyu, Wenguang Chen

TL;DR

This work reveals a fundamental tension between ascending data curricula and decaying learning rate schedules in large-language-model pretraining. By systematically analyzing the interaction, the authors show that curriculum gains collapse under standard LR decay, especially when high-quality data appears late in training. They propose two pragmatic remedies—moderate LR decay and model averaging—and unify them into Curriculum Model Averaging (CMA), which, together with data curricula, yields robust improvements (up to 1.64% average accuracy) at 1.5B parameter scale trained on 30B tokens. The paper also introduces Curriculum with LR Decay and Model Averaging (CDMA) as a further optimization, supported by a simple theoretical model and extensive ablations across metrics, datasets, and mid-training scenarios. Overall, the results advocate co-designing data curricula with training dynamics to enhance pretraining efficiency and performance.

Abstract

Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticated data curation. A natural approach to better leverage high-quality data is curriculum-based pretraining, where the model is trained on data sorted in ascending order of quality as determined by a quality metric. However, prior studies have reported limited improvements from such curriculum-based pretraining strategies. This work identifies a critical factor constraining these methods: the incompatibility between the ascending data quality order and the decaying learning rate (LR) schedule. We find that while curriculum-based training substantially outperforms random shuffling when using a constant LR, its advantage diminishes under standard LR decay schedules. Our experiments show this incompatibility can be mitigated by two simple strategies: (1) employing a more moderate LR decay schedule, where the final LR is only moderately smaller than the peak LR, and (2) replacing LR decay with model averaging, i.e., computing a weighted average of the final few checkpoints. By combining these strategies, we improve the average score on a suite of standard benchmarks by 1.64% over random shuffling, without additional data refinement. Validated on 1.5B-parameter models trained over 30B tokens with various data-quality metrics, our findings call for a re-evaluation of curriculum-based LLM pretraining and underscore the potential of co-designing data curricula with optimization methods.

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

TL;DR

This work reveals a fundamental tension between ascending data curricula and decaying learning rate schedules in large-language-model pretraining. By systematically analyzing the interaction, the authors show that curriculum gains collapse under standard LR decay, especially when high-quality data appears late in training. They propose two pragmatic remedies—moderate LR decay and model averaging—and unify them into Curriculum Model Averaging (CMA), which, together with data curricula, yields robust improvements (up to 1.64% average accuracy) at 1.5B parameter scale trained on 30B tokens. The paper also introduces Curriculum with LR Decay and Model Averaging (CDMA) as a further optimization, supported by a simple theoretical model and extensive ablations across metrics, datasets, and mid-training scenarios. Overall, the results advocate co-designing data curricula with training dynamics to enhance pretraining efficiency and performance.

Abstract

Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticated data curation. A natural approach to better leverage high-quality data is curriculum-based pretraining, where the model is trained on data sorted in ascending order of quality as determined by a quality metric. However, prior studies have reported limited improvements from such curriculum-based pretraining strategies. This work identifies a critical factor constraining these methods: the incompatibility between the ascending data quality order and the decaying learning rate (LR) schedule. We find that while curriculum-based training substantially outperforms random shuffling when using a constant LR, its advantage diminishes under standard LR decay schedules. Our experiments show this incompatibility can be mitigated by two simple strategies: (1) employing a more moderate LR decay schedule, where the final LR is only moderately smaller than the peak LR, and (2) replacing LR decay with model averaging, i.e., computing a weighted average of the final few checkpoints. By combining these strategies, we improve the average score on a suite of standard benchmarks by 1.64% over random shuffling, without additional data refinement. Validated on 1.5B-parameter models trained over 30B tokens with various data-quality metrics, our findings call for a re-evaluation of curriculum-based LLM pretraining and underscore the potential of co-designing data curricula with optimization methods.

Paper Structure

This paper contains 54 sections, 3 theorems, 28 equations, 8 figures, 9 tables, 1 algorithm.

Key Result

Theorem 6.1

Given a learning rate $\eta_0 \le 1$, the parameter derived by the averaging on the last $n$ weights $\bar{{\bm{w}}}_M = \frac{1}{n}\sum_{t = 0}^{n-1} {\bm{w}}_{M-t}$, where $n = \Theta(M^{\frac{2}{3}})$ such that the expected loss where $\tilde{O}(\cdot)$ hides log factors and constants independent of $L$ and $M$.

Figures (8)

  • Figure 1: Data curriculum strategies are less effective when combined with learning rate (LR) schedules that decay to a low scale near the end. (a-c) Experiments on a 1.5B parameter model trained on 30B tokens compare various data curricula (Uniform, Ascending-Order, and Descending-Order by DCLM score DCLM) under constant, Warmup-Stable-Decay (WSD) hu2024minicpmunveilingpotentialsmallhagele, and cosine schedules. While curricula improve validation loss over a uniform baseline with a constant LR, this advantage is significantly reduced during a low-LR phase following LR decay. (d) In the data curriculum, high-quality data is placed in the latter phase, which coincides with the LR decaying to a relatively low scale.
  • Figure 2: When varying the decay steps across 37%, 18%, 6% and 0% of training (Long, Mid, Short, Zero, respectively) and ending LRs ($1 \times 10^{-5}, 1 \times 10^{-3}, 2 \times 10^{-3}, 3 \times 10^{-3}$), the benefit of data curriculum diminishes with more aggressive LR decay. For each LR decay, we train 1.5B-parameter models with uniform and ascending ordering of data based on DCLM scores, and measure the difference in validation loss. As shown in (b) and (d), this difference becomes smaller with more decay steps or smaller ending LRs.
  • Figure 3: A stage-wise "data folding" curriculum mitigates the negative interaction observed between data ordering and learning rate (LR) decay (detailed in \ref{['sec:folding']}), but data folding can not match end-to-end sorting under a constant learning rate. Left: We compare simple ascending curricula (Ascend), sorted by DCLM score, against their "folding" counterparts (Ascend+Folding). The folding method involves partitioning the data into stages (three in our implementation) and performing the sort within each stage. The Descend(+Folding) curriculum is designed in reverse order. Middle: Under a standard cosine LR schedule, folding strategies reduce validation loss compared to simple sorting but are outperformed by a uniform data baseline. Right: Conversely, with a constant LR schedule where decay does not weaken the utility of high-quality data, the advantage of folding vanishes, and a simple ascending-order curriculum becomes the most effective strategy.
  • Figure 4: Visualization of our intuition about the interplay between data ordering and LR schedules. We assume the gradient update can be decomposed as a signal direction and a noise direction. High-quality data can offer a less noisy direction and a more stable signal direction, while low-quality data can induce a more noisy update. Uniform+Decay, Ascend+Decay and Ascend+EMA represent different training strategies. Ascend+EMA can make the best use of the high-quality data in the curriculum. The right-hand figure shows the projection of the trajectories of the last 8 steps for these cases onto the $w_{noise}$-$\mathcal{L}(w)$ plane.
  • Figure 5: This figure compares various training strategies, identifying a high-performing and previously underexplored Optimal Regime where moderate learning rate (LR) decay, weight averaging, and curriculum learning produce synergistic advantages. We run experiments on both Uniform (uniformly ordered data) and Ascend (training data arranged by ascending DCLM scores) data schedules. For both schedules, we conduct an ablation on the ending learning rates of WSD schedules, ranging from $1\times 10^{-5}$ to $1\times 10^{-3}$, representing aggressive to moderate LR decay. We denote strategies applying weight averaging as EMA, which compute the final model checkpoint via an EMA of the last six checkpoints, and denote those strategies without weight averaging as WSD. We measure performance by the average downstream task score (as in \ref{['tab:main_results']}). This newly identified regime contrasts with the Previous Focus Regime, which represents common practices without a data curriculum or weight averaging, and with an ending LR between $1 \times 10^{-5}$ and $1 \times 10^{-4}$. This range is typical in prior work, which often uses an ending LR of one-tenth of a peak LR (on the scale of $\times10^{-4}$) grattafiori2024llama3herdmodelsdeepseekai2025deepseekv3technicalreport or fixes the ending LR around $10^{-5}$DCLMli2025steplaw1. This observation also holds for mid-training settings.
  • ...and 3 more figures

Theorems & Definitions (6)

  • Theorem 6.1
  • Lemma C.1
  • proof
  • Lemma C.2
  • proof
  • proof : Proof of \ref{['thm:swa loss']}