Reuse, Don't Retrain: A Recipe for Continued Pretraining of Language Models

Jupinder Parmar; Sanjev Satheesh; Mostofa Patwary; Mohammad Shoeybi; Bryan Catanzaro

Reuse, Don't Retrain: A Recipe for Continued Pretraining of Language Models

Jupinder Parmar, Sanjev Satheesh, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro

TL;DR

The paper tackles the high cost of pretraining language models by proposing a practical continued pretraining recipe to reuse existing checkpoints. It shows that a two-stage data-distribution strategy (GB then QA) combined with a carefully tuned cosine learning-rate schedule and an optimal distribution switch point yields meaningful gains over naïve continued training, with robustness across horizons from 100B to 1T tokens. Document mining further amplifies benefits by focusing on QA-relevant data. Collectively, the work provides a transferable, scalable workflow for improving general capabilities without starting from scratch, potentially lowering the barrier for deploying stronger LMs.

Abstract

As language models have scaled both their number of parameters and pretraining dataset sizes, the computational cost for pretraining has become intractable except for the most well-resourced teams. This increasing cost makes it ever more important to be able to reuse a model after it has completed pretraining; allowing for a model's abilities to further improve without needing to train from scratch. In this work, we detail a set of guidelines that cover how to design efficacious data distributions and learning rate schedules for continued pretraining of language models. When applying these findings within a continued pretraining run on top of a well-trained 15B parameter model, we show an improvement of 9\% in average model accuracy compared to the baseline of continued training on the pretraining set. The resulting recipe provides a practical starting point with which to begin developing language models through reuse rather than retraining.

Reuse, Don't Retrain: A Recipe for Continued Pretraining of Language Models

TL;DR

Abstract

Paper Structure (26 sections, 7 figures, 20 tables)

This paper contains 26 sections, 7 figures, 20 tables.

Introduction
Related Works
Experimental Setup
Data Sources
Pretraining
Continued Pretraining
Model Architecture and Hyperparameters
Evaluation
Continued Pretraining Recipe
Experiments
Data Distribution
Learning Rate Schedule
Switch of Data Distributions
Ablations
Varying Token Horizons
...and 11 more sections

Figures (7)

Figure 1: Breakdown of the various distributions considered for the General Blend (GB). We use Upweight Non Web w/ High Quality Web as the GB moving forward given its strong performance across all evaluation areas.
Figure 2: Various distributions of QA data. We use Blend 3.
Figure 3: Cosine decay schedules with a Max LR of $4.5e\text{-}5$. Each schedule differently prioritizes LR magnitude and slope of decay.
Figure 4: Breakdown of the various distributions considered for the QB. $N$e refers to $N$ epochs of the QA data. The final chosen distribution is shown as QA Blend which used 2 epochs of QA data.
Figure 5: How the number of QB tokens, the shaded region, varies based on different distribution switch points.
...and 2 more figures

Reuse, Don't Retrain: A Recipe for Continued Pretraining of Language Models

TL;DR

Abstract

Reuse, Don't Retrain: A Recipe for Continued Pretraining of Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)