Table of Contents
Fetching ...

From Acceleration to Saturation: Scaling Behavior of Bootstrapped Language Model Pretraining

Seng Pei Liew, Takuya Kato

TL;DR

This paper investigates bootstrapped pretraining, including continual pretraining and model growth, and demonstrates that gains from a second-stage training saturate when the base model is overtrained. It introduces a data-scaling framework where the two-stage loss $L(D_1,D_2)$ follows a multiplicative form with an interaction term, $L(D_1,D_2) = A D_1^{-\alpha_1} D_2^{-\alpha_2 + \\alpha_3 \\log D_1} + E$, capturing how first- and second-stage tokens jointly determine performance. The authors extend the law with model size to produce a joint scaling law and quantify practical implications, including when bootstrapping remains compute-optimal versus when training from scratch is preferable. Overall, the results provide a quantitative, data-driven guide for efficiently planning multi-stage pretraining and highlight the saturation risks of relying on overtrained bases in continual pretraining and model-growth workflows.

Abstract

Bootstrapped pretraining, i.e., the reuse of a pretrained base model for further pretraining, such as continual pretraining or model growth, is promising at reducing the cost of training language models from scratch. However, its effectiveness remains unclear, especially when applied to overtrained base models. In this work, we empirically study the scaling behavior of bootstrapped pretraining and find that its scaling efficiency diminishes in a predictable manner: The scaling exponent with respect to second-stage pretraining tokens decreases logarithmically with the number of tokens used to pretrain the base model. The joint dependence on first- and second-stage tokens is accurately modeled by a simple scaling law. Such saturation effect reveals a fundamental trade-off in multi-stage pretraining strategies: the more extensively a model is pretrained, the less additional benefit bootstrapping provides. Our findings provide practical insights for efficient language model training and raise important considerations for the reuse of overtrained models.

From Acceleration to Saturation: Scaling Behavior of Bootstrapped Language Model Pretraining

TL;DR

This paper investigates bootstrapped pretraining, including continual pretraining and model growth, and demonstrates that gains from a second-stage training saturate when the base model is overtrained. It introduces a data-scaling framework where the two-stage loss follows a multiplicative form with an interaction term, , capturing how first- and second-stage tokens jointly determine performance. The authors extend the law with model size to produce a joint scaling law and quantify practical implications, including when bootstrapping remains compute-optimal versus when training from scratch is preferable. Overall, the results provide a quantitative, data-driven guide for efficiently planning multi-stage pretraining and highlight the saturation risks of relying on overtrained bases in continual pretraining and model-growth workflows.

Abstract

Bootstrapped pretraining, i.e., the reuse of a pretrained base model for further pretraining, such as continual pretraining or model growth, is promising at reducing the cost of training language models from scratch. However, its effectiveness remains unclear, especially when applied to overtrained base models. In this work, we empirically study the scaling behavior of bootstrapped pretraining and find that its scaling efficiency diminishes in a predictable manner: The scaling exponent with respect to second-stage pretraining tokens decreases logarithmically with the number of tokens used to pretrain the base model. The joint dependence on first- and second-stage tokens is accurately modeled by a simple scaling law. Such saturation effect reveals a fundamental trade-off in multi-stage pretraining strategies: the more extensively a model is pretrained, the less additional benefit bootstrapping provides. Our findings provide practical insights for efficient language model training and raise important considerations for the reuse of overtrained models.

Paper Structure

This paper contains 34 sections, 18 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Bootstrapped pretraining with overtrained base models leads to saturation in scaling behavior.Left: $D_2$ has power-law scaling. We show scaling behavior of second-stage training tokens ($D_2$) for different values of first-stage tokens ($D_1$). Middle: Interaction term explains decreasing exponents. The fitted exponents in the left plots are used to fit Equation \ref{['eq:log']} as a function of $D_1$, and are shown to agree well with the functional form. Right: Scaling factor has power-law scaling w.r.t. $D_1$.Top: Continual pretraining (CPT) on code data. Bottom: Model growth by stacking.
  • Figure 2: Illustration of bootstrapped pretraining in consideration. Bootstrapped pretraining consists of two stages: (1) first-stage pretraining of a base model for $D_1$ tokens on internet/generic data; (2) bootstrapping/second-stage pretraining via continual pretraining or model growth for $D_2$ tokens. Section \ref{['sec:formulation']} and \ref{['sec:data']} study and develop scaling laws as a function of these two variables (and additionally model size, $N$ in Section \ref{['sec:closer']}) to predict the final loss after the second stage.
  • Figure 3: $D_1$ has power-law scaling. We show scaling behavior of first-stage training tokens ($D_1$) for different values of second-stage tokens ($D_2$), indicating that $D_1$ also has power-law scaling. Left: Continual pretraining on code data; Right: model growth by stacking. More plots in Appendix \ref{['app:scaling']}.
  • Figure 4: Left: Joint scaling law fit for continual pretraining on code data. Orange points indicate the 10% of data with lowest losses used for validation. Right: Model growth efficiency decreases with sunk cost and model size. For token budgets above the curve(s), training from scratch is more efficient than stacking-based model growth; for budgets below, model growth remains advantageous. Shown are the numerical (blue) and analytical (red) solutions of Equation \ref{['eq:sol']}.
  • Figure 5: WSD vs cosine LR schedule. We show that the WSD LR schedule achieves similar (even slightly better) final loss compared to the more commonly used cosine LR schedule.
  • ...and 6 more figures