CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models

Jiawei Gu; Zacc Yang; Chuanghao Ding; Rui Zhao; Fei Tan

CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models

Jiawei Gu, Zacc Yang, Chuanghao Ding, Rui Zhao, Fei Tan

TL;DR

This paper investigates Continual Pre-Training (CPT) for large language models by examining how to optimally mix general and domain-specific data. It formalizes a constrained optimization framework using a Lagrangian, defines feasible mixture ratios and a Critical Mixture Ratio (CMR) as the maximum feasible domain-data share under a general-loss tolerance, and demonstrates a power-law relationship linking losses, mixture ratio, and training tokens. Through extensive experiments on 460M–3.1B LLMs across Finance and Academic Papers, it shows that CMR can be predicted via a data-budget scaling law $R_{CMR} = eta_0 + eta_1 T^{s}$, with CMR growing with model size and domain closeness; concrete predictions are provided (e.g., 29.8%–47.8% across model sizes for $T_{max}=20$B). The work offers practical guidelines for efficiently balancing general and domain-specific knowledge during CPT, while acknowledging limitations such as computational constraints, domain scope, and the need for downstream evaluations.

Abstract

Large Language Models (LLMs) excel in diverse tasks but often underperform in specialized fields due to limited domain-specific or proprietary corpus. Continual pre-training (CPT) enhances LLM capabilities by imbuing new domain-specific or proprietary knowledge while replaying general corpus to prevent catastrophic forgetting. The data mixture ratio of general corpus and domain-specific corpus, however, has been chosen heuristically, leading to sub-optimal training efficiency in practice. In this context, we attempt to re-visit the scaling behavior of LLMs under the hood of CPT, and discover a power-law relationship between loss, mixture ratio, and training tokens scale. We formalize the trade-off between general and domain-specific capabilities, leading to a well-defined Critical Mixture Ratio (CMR) of general and domain data. By striking the balance, CMR maintains the model's general ability and achieves the desired domain transfer, ensuring the highest utilization of available resources. Considering the balance between efficiency and effectiveness, CMR can be regarded as the optimal mixture ratio. Through extensive experiments, we ascertain the predictability of CMR, propose CMR scaling law and have substantiated its generalization. These findings offer practical guidelines for optimizing LLM training in specialized domains, ensuring both general and domain-specific performance while efficiently managing training resources.

CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models

TL;DR

, with CMR growing with model size and domain closeness; concrete predictions are provided (e.g., 29.8%–47.8% across model sizes for

B). The work offers practical guidelines for efficiently balancing general and domain-specific knowledge during CPT, while acknowledging limitations such as computational constraints, domain scope, and the need for downstream evaluations.

Abstract

Paper Structure (38 sections, 33 equations, 10 figures, 5 tables)

This paper contains 38 sections, 33 equations, 10 figures, 5 tables.

Introduction
Key Results
Background and Methods
Continual Pre-training on Mixed Dataset
Visualization
Method
LLM Architecture
Experiment Setup
Evaluation
Does the Critical Mixture Ratio Exist?
Findings
Is CMR Predictable?
Predicting Losses of Mixture Ratio
Predicting Losses of Training Tokens
Predicting CMR
...and 23 more sections

Figures (10)

Figure 1: Follow the direction of the training trajectory to track the trend of the curve. Each bunch of lines represents a model size scale: $\{3.1\mathrm{B},1.6\mathrm{B},940\mathrm{M},460\mathrm{M} \}$ and each group of line colors represents the mixture ratios $\{1/8, 1/4, 1/3, 1/2\}$ from dark to light. In order to better display the trend, we have omitted proportions greater than $1/2$. The yellow dashed lines point horizontally, indicating the corresponding ratios where $d \mathcal{L}_{\Delta \text{gen}}/d\mathcal{L}_{\Delta \text{dom}}$ closed to $0$. The third set of lines of model size $940\textrm{M}$, which has been zoomed in and depicted on the right side, showing the trend of the training curve more apparently. All horizontal and vertical cross-sections of the 3D diagram on the left side are detailed in the Appendix \ref{['apd:more_figure']}.
Figure 2: Follow the direction of the training trajectory to track the trend of the curve. The $\mathcal{L}_{\Delta \text{gen}}$ and $\mathcal{L}_{\Delta \text{dom}}$ loss functions for the models at mixture ratios of $1/4$ and $1/3$ are illustrated.
Figure 3: The upper figure shows the fitting curve of domain loss $\mathcal{L}_\text{dom}$ with the change of mixture ratio $R$, and the lower figure shows the fitting curve of general loss $\mathcal{L}_\text{gen}$. The solid circles ($\bullet$) represent real losses, and the stars (★) represent the predicted losses.
Figure 4: The figure shows the general loss of $M_{1.6B}$ fitting and extrapolating at four distinct mixture ratios: $\{1/8, 1/4, 1/3, 1/2\}$. As the ratio increases, the curve gradually rises when training data volume increases.
Figure 5: We can use the CMR scaling laws to predict CMRs under fixed model size $S$, and are extrapolated to $T=250$, which is equivalent to a training volume of $500\mathrm{B}$ tokens.
...and 5 more figures

Theorems & Definitions (3)

Definition 1
Definition 2
Definition 3

CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models

TL;DR

Abstract

CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (10)

Theorems & Definitions (3)