Table of Contents
Fetching ...

D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models

Haoran Que, Jiaheng Liu, Ge Zhang, Chenchen Zhang, Xingwei Qu, Yinghao Ma, Feiyu Duan, Zhiqi Bai, Jiakai Wang, Yuanxing Zhang, Xu Tan, Jie Fu, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng

TL;DR

This work introduces the Domain-specific Continual Pre-Training (D-CPT) Law, a scaling-law-based framework for predicting domain-aware performance of LLMs as a function of model size $N$, data size $D$, and mixture ratio $r$ between general and domain corpora. It proposes a preferred parameterization $L(N,D,r) = E + rac{A}{N^ ext{α}} + rac{B r^ ext{η}}{D^ ext{β}} + rac{C}{(r+oldsymbol{ ext ε})^ ext γ}$ to fit in-domain and generalize across scales, plus a Cross-Domain version introducing a Domain-specific Learnable Coefficient $K$ via $L(N,D,r,K) = E + rac{A}{N^ ext{α}} + rac{B r^ ext{η}}{D^ ext{β}} + rac{C}{(r+oldsymbol{ ext ε})^ ext γ} + rac{F}{K^ ext μ}$. Extensive experiments across six downstream domains and three model sizes demonstrate strong fit (high $R^2$, low Huber loss) and robust generalization, with practical usages for trade-offs between general and domain-specific abilities, data-limited domain adaptation, and resource allocation. The Cross-Domain extension further enables efficient fitting to unseen domains by estimating a compact DLC representation. Overall, the paper provides a quantitative toolkit to predict and optimize domain-adapted CPT with reduced computational cost, offering actionable guidance for real-world deployment of domain-specific LLMs.

Abstract

Continual Pre-Training (CPT) on Large Language Models (LLMs) has been widely used to expand the model's fundamental understanding of specific downstream domains (e.g., math and code). For the CPT on domain-specific LLMs, one important question is how to choose the optimal mixture ratio between the general-corpus (e.g., Dolma, Slim-pajama) and the downstream domain-corpus. Existing methods usually adopt laborious human efforts by grid-searching on a set of mixture ratios, which require high GPU training consumption costs. Besides, we cannot guarantee the selected ratio is optimal for the specific domain. To address the limitations of existing methods, inspired by the Scaling Law for performance prediction, we propose to investigate the Scaling Law of the Domain-specific Continual Pre-Training (D-CPT Law) to decide the optimal mixture ratio with acceptable training costs for LLMs of different sizes. Specifically, by fitting the D-CPT Law, we can easily predict the general and downstream performance of arbitrary mixture ratios, model sizes, and dataset sizes using small-scale training costs on limited experiments. Moreover, we also extend our standard D-CPT Law on cross-domain settings and propose the Cross-Domain D-CPT Law to predict the D-CPT law of target domains, where very small training costs (about 1% of the normal training costs) are needed for the target domains. Comprehensive experimental results on six downstream domains demonstrate the effectiveness and generalizability of our proposed D-CPT Law and Cross-Domain D-CPT Law.

D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models

TL;DR

This work introduces the Domain-specific Continual Pre-Training (D-CPT) Law, a scaling-law-based framework for predicting domain-aware performance of LLMs as a function of model size , data size , and mixture ratio between general and domain corpora. It proposes a preferred parameterization to fit in-domain and generalize across scales, plus a Cross-Domain version introducing a Domain-specific Learnable Coefficient via . Extensive experiments across six downstream domains and three model sizes demonstrate strong fit (high , low Huber loss) and robust generalization, with practical usages for trade-offs between general and domain-specific abilities, data-limited domain adaptation, and resource allocation. The Cross-Domain extension further enables efficient fitting to unseen domains by estimating a compact DLC representation. Overall, the paper provides a quantitative toolkit to predict and optimize domain-adapted CPT with reduced computational cost, offering actionable guidance for real-world deployment of domain-specific LLMs.

Abstract

Continual Pre-Training (CPT) on Large Language Models (LLMs) has been widely used to expand the model's fundamental understanding of specific downstream domains (e.g., math and code). For the CPT on domain-specific LLMs, one important question is how to choose the optimal mixture ratio between the general-corpus (e.g., Dolma, Slim-pajama) and the downstream domain-corpus. Existing methods usually adopt laborious human efforts by grid-searching on a set of mixture ratios, which require high GPU training consumption costs. Besides, we cannot guarantee the selected ratio is optimal for the specific domain. To address the limitations of existing methods, inspired by the Scaling Law for performance prediction, we propose to investigate the Scaling Law of the Domain-specific Continual Pre-Training (D-CPT Law) to decide the optimal mixture ratio with acceptable training costs for LLMs of different sizes. Specifically, by fitting the D-CPT Law, we can easily predict the general and downstream performance of arbitrary mixture ratios, model sizes, and dataset sizes using small-scale training costs on limited experiments. Moreover, we also extend our standard D-CPT Law on cross-domain settings and propose the Cross-Domain D-CPT Law to predict the D-CPT law of target domains, where very small training costs (about 1% of the normal training costs) are needed for the target domains. Comprehensive experimental results on six downstream domains demonstrate the effectiveness and generalizability of our proposed D-CPT Law and Cross-Domain D-CPT Law.
Paper Structure (59 sections, 31 equations, 17 figures, 21 tables)

This paper contains 59 sections, 31 equations, 17 figures, 21 tables.

Figures (17)

  • Figure 1: Illustration of the performance of D-CPT Law. (Left): The curves show the relationship between $L_g$ and $r_g$ under different dataset sizes $D$ for Qwen1.5-1.8B model. CPT data are a mixture of code-corpus and general-corpus. Here, $L_g$ represents the loss on the general-corpus validation set, while $r_g$ indicates the percentage of the general corpus in the training data. The dashed curves denote the curves predicted by D-CPT Law, circular markers and star markers are fitting data points and unseen validation points, respectively. (Right): These curves are the corresponding results between the code-corpus validation loss $L_d$ and the percentage of the code-corpus data $r_d$.
  • Figure 2: Illustration of D-CPT Law and Cross-Domain CPT-Law pipeline. (Upper): In D-CPT Law, we first collect domain-corpus and general-corpus, and conduct experiments under a small-scale experimental setup to gather empirical data points to fit the D-CPT Law. After that, we can predict the model's performance in large-scale experimental settings. (Lower): In Cross-Domain CPT-Law, for an unseen downstream domain, like Physics, we can calculate its Domain-specific Learnable Coefficient value and incorporate it into the fitted Cross-Domain D-CPT Law to derive the D-CPT Law for this new domain. Based on the D-CPT Law, we introduce three application scenarios: optimal mixture on the trade-off between general and domain-specific abilities, optimal mixture for limited domain-specific data, and resource allocation in Section \ref{['usage-dcptlaw']}.
  • Figure 3: Effectiveness of D-CPT Law ($L_3$). (left two): General-corpus validation loss $L_g$ with respect to dataset size $D$ across different model sizes $N$, domain-corpus is code and general-corpus mixture ratio $r_g=0.5$. (right two): Domain-corpus validation loss $L_d$ with respect to dataset size $D$ across different model sizes $N$, domain-corpus is code and domain-corpus mixture ratio $r_d=0.5$.
  • Figure 4: $L_g$ with respect to $D$, domain-corpus is code, $r_g=0.2$, $N=7B$.
  • Figure 5: Effectiveness of Cross-Domain D-CPT Law ($K_3$). (left two): $L_g$ with respect to dataset size $D$ across different model size $N$, domain-corpus is music and $r_g$ is 0.2. (right two): $L_d$ with respect to dataset size $D$ across different model size $N$, domain-corpus is music and $r_d$ is 0.8.
  • ...and 12 more figures