From Acceleration to Saturation: Scaling Behavior of Bootstrapped Language Model Pretraining

Seng Pei Liew; Takuya Kato

From Acceleration to Saturation: Scaling Behavior of Bootstrapped Language Model Pretraining

Seng Pei Liew, Takuya Kato

TL;DR

This paper investigates bootstrapped pretraining, including continual pretraining and model growth, and demonstrates that gains from a second-stage training saturate when the base model is overtrained. It introduces a data-scaling framework where the two-stage loss $L(D_1,D_2)$ follows a multiplicative form with an interaction term, $L(D_1,D_2) = A D_1^{-\alpha_1} D_2^{-\alpha_2 + \\alpha_3 \\log D_1} + E$, capturing how first- and second-stage tokens jointly determine performance. The authors extend the law with model size to produce a joint scaling law and quantify practical implications, including when bootstrapping remains compute-optimal versus when training from scratch is preferable. Overall, the results provide a quantitative, data-driven guide for efficiently planning multi-stage pretraining and highlight the saturation risks of relying on overtrained bases in continual pretraining and model-growth workflows.

Abstract

Bootstrapped pretraining, i.e., the reuse of a pretrained base model for further pretraining, such as continual pretraining or model growth, is promising at reducing the cost of training language models from scratch. However, its effectiveness remains unclear, especially when applied to overtrained base models. In this work, we empirically study the scaling behavior of bootstrapped pretraining and find that its scaling efficiency diminishes in a predictable manner: The scaling exponent with respect to second-stage pretraining tokens decreases logarithmically with the number of tokens used to pretrain the base model. The joint dependence on first- and second-stage tokens is accurately modeled by a simple scaling law. Such saturation effect reveals a fundamental trade-off in multi-stage pretraining strategies: the more extensively a model is pretrained, the less additional benefit bootstrapping provides. Our findings provide practical insights for efficient language model training and raise important considerations for the reuse of overtrained models.

From Acceleration to Saturation: Scaling Behavior of Bootstrapped Language Model Pretraining

TL;DR

Abstract

From Acceleration to Saturation: Scaling Behavior of Bootstrapped Language Model Pretraining

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)