How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining
Kairong Luo, Zhenbo Sun, Haodong Wen, Xinyu Shi, Jiarui Cui, Chenyi Dang, Kaifeng Lyu, Wenguang Chen
TL;DR
This work reveals a fundamental tension between ascending data curricula and decaying learning rate schedules in large-language-model pretraining. By systematically analyzing the interaction, the authors show that curriculum gains collapse under standard LR decay, especially when high-quality data appears late in training. They propose two pragmatic remedies—moderate LR decay and model averaging—and unify them into Curriculum Model Averaging (CMA), which, together with data curricula, yields robust improvements (up to 1.64% average accuracy) at 1.5B parameter scale trained on 30B tokens. The paper also introduces Curriculum with LR Decay and Model Averaging (CDMA) as a further optimization, supported by a simple theoretical model and extensive ablations across metrics, datasets, and mid-training scenarios. Overall, the results advocate co-designing data curricula with training dynamics to enhance pretraining efficiency and performance.
Abstract
Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticated data curation. A natural approach to better leverage high-quality data is curriculum-based pretraining, where the model is trained on data sorted in ascending order of quality as determined by a quality metric. However, prior studies have reported limited improvements from such curriculum-based pretraining strategies. This work identifies a critical factor constraining these methods: the incompatibility between the ascending data quality order and the decaying learning rate (LR) schedule. We find that while curriculum-based training substantially outperforms random shuffling when using a constant LR, its advantage diminishes under standard LR decay schedules. Our experiments show this incompatibility can be mitigated by two simple strategies: (1) employing a more moderate LR decay schedule, where the final LR is only moderately smaller than the peak LR, and (2) replacing LR decay with model averaging, i.e., computing a weighted average of the final few checkpoints. By combining these strategies, we improve the average score on a suite of standard benchmarks by 1.64% over random shuffling, without additional data refinement. Validated on 1.5B-parameter models trained over 30B tokens with various data-quality metrics, our findings call for a re-evaluation of curriculum-based LLM pretraining and underscore the potential of co-designing data curricula with optimization methods.
