Table of Contents
Fetching ...

EvoLM: In Search of Lost Language Model Training Dynamics

Zhenting Qi, Fan Nie, Alexandre Alahi, James Zou, Himabindu Lakkaraju, Yilun Du, Eric Xing, Sham Kakade, Hanlin Zhang

TL;DR

EvoLM presents a transparent, end-to-end framework to study language-model training dynamics across pre-training, continued pre-training, supervised fine-tuning, and reinforcement learning, using 100+ decoder-only LMs at 1B and 4B parameters. By analyzing upstream and downstream performance on in-domain and out-of-domain tasks, the work reveals diminishing returns from excessive pre-training and from overlong post-training, highlights the importance of mitigating forgetting during CPT, and uncovers intricate SFT/RL trade-offs. The study introduces data replay to curtail catastrophic forgetting, demonstrates domain-specific CPT benefits when paired with appropriate SFT/RL configurations, and shows that ORM scores can serve as strong unsupervised predictors of downstream reasoning performance. All models, training data, and evaluation pipelines are released to enable reproducibility and community-driven progress in understanding training dynamics and scaling behaviors.

Abstract

Modern language model (LM) training has been divided into multiple stages, making it difficult for downstream developers to evaluate the impact of design choices made at each stage. We present EvoLM, a model suite that enables systematic and transparent analysis of LMs' training dynamics across pre-training, continued pre-training, supervised fine-tuning, and reinforcement learning. We train over 100 LMs with 1B and 4B parameters from scratch, and evaluate both upstream (language modeling) and downstream (problem-solving) capabilities, including considerations of both in-domain and out-of-domain generalization. Key insights highlight the diminishing returns from excessive pre-training and post-training, the importance and practices of mitigating forgetting during domain-specific continued pre-training, the crucial role of continued pre-training in bridging pre-training and post-training phases, and various intricate trade-offs when configuring supervised fine-tuning and reinforcement learning. To facilitate open research and reproducibility, we release all pre-trained and post-trained models, training datasets for all stages, and our entire training and evaluation pipeline.

EvoLM: In Search of Lost Language Model Training Dynamics

TL;DR

EvoLM presents a transparent, end-to-end framework to study language-model training dynamics across pre-training, continued pre-training, supervised fine-tuning, and reinforcement learning, using 100+ decoder-only LMs at 1B and 4B parameters. By analyzing upstream and downstream performance on in-domain and out-of-domain tasks, the work reveals diminishing returns from excessive pre-training and from overlong post-training, highlights the importance of mitigating forgetting during CPT, and uncovers intricate SFT/RL trade-offs. The study introduces data replay to curtail catastrophic forgetting, demonstrates domain-specific CPT benefits when paired with appropriate SFT/RL configurations, and shows that ORM scores can serve as strong unsupervised predictors of downstream reasoning performance. All models, training data, and evaluation pipelines are released to enable reproducibility and community-driven progress in understanding training dynamics and scaling behaviors.

Abstract

Modern language model (LM) training has been divided into multiple stages, making it difficult for downstream developers to evaluate the impact of design choices made at each stage. We present EvoLM, a model suite that enables systematic and transparent analysis of LMs' training dynamics across pre-training, continued pre-training, supervised fine-tuning, and reinforcement learning. We train over 100 LMs with 1B and 4B parameters from scratch, and evaluate both upstream (language modeling) and downstream (problem-solving) capabilities, including considerations of both in-domain and out-of-domain generalization. Key insights highlight the diminishing returns from excessive pre-training and post-training, the importance and practices of mitigating forgetting during domain-specific continued pre-training, the crucial role of continued pre-training in bridging pre-training and post-training phases, and various intricate trade-offs when configuring supervised fine-tuning and reinforcement learning. To facilitate open research and reproducibility, we release all pre-trained and post-trained models, training datasets for all stages, and our entire training and evaluation pipeline.

Paper Structure

This paper contains 38 sections, 14 figures, 9 tables.

Figures (14)

  • Figure 1: Overview of EvoLM, a transparent model suite for studying language-model training dynamics across pre-training, continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL). The framework evaluates both upstream (language modeling) and downstream (problem-solving) performance across in-domain (e.g., math) and out-of-domain (e.g., code, logic) settings, enabling systematic analysis of design trade-offs and scaling behaviors.
  • Figure 2: Upstreamtask performance vs. pretraining tokens on models {0.5B, 1B, 4B}-{10BT, 20BT, 40BT, 80BT, 160BT, 320BT}.
  • Figure 3: Downstreamtask performance vs. number of pretraining tokens on models: - SFT: 1B-{20BT, 40BT, 80BT, 160BT, 320BT}-8+42BT-100Kep1 - SFT+RL: 1B-{20BT, 40BT, 80BT, 160BT, 320BT}-8+42BT-100Kep1-100Kep8.
  • Figure 4: Upstreamtask performance vs. CPT tokens on models: - Pretrained:1B-160BT, - CPT:1B-160BT-8+{2BT, ..., 42BT}, - CPT:1B-160BT-0+{10BT, ..., 50BT}, - CPT:1B-160BT-{1.6+48.4BT, 16+34BT}.
  • Figure 5: Downstreamtask performance vs. continued pre-training tokens on models: - SFT:1B-160BT-100Kep1, 1B-160BT-8+{2BT, ..., 42BT}-100Kep1 - SFT+RL:1B-160BT-100Kep1-100Kep8, 1B-160BT-8+{2BT, ..., 42BT}-100Kep1-100Kep8.
  • ...and 9 more figures