Towards Effective and Efficient Continual Pre-training of Large Language Models
Jie Chen, Zhipeng Chen, Jiapeng Wang, Kun Zhou, Yutao Zhu, Jinhao Jiang, Yingqian Min, Wayne Xin Zhao, Zhicheng Dou, Jiaxin Mao, Yankai Lin, Ruihua Song, Jun Xu, Xu Chen, Rui Yan, Zhewei Wei, Di Hu, Wenbing Huang, Ji-Rong Wen
TL;DR
This work targets two key gaps in large language models: improving Chinese language ability and multidisciplinary scientific reasoning while mitigating catastrophic forgetting during continual pre-training (CPT). The authors propose a two-stage CPT for Llama-3 (8B): first, bilingual adaptation with a topic-based data mixture and a perplexity-based curriculum to strengthen Chinese capabilities and preserve English performance; second, synthetic enhancement that injects large-scale synthetic data for science and coding, yielding a 1:7:2 mix of Chinese:English:synthetic data. They synthesize scientific QA data across nine disciplines and code QA data using state-of-the-art generation tools, and validate the approach with extensive experiments on TinyLlama and Llama-3, reporting substantial gains on Chinese benchmarks (C-Eval +8.81, CMMLU +6.31) and scientific reasoning benchmarks (MATH +12.00, SciEval +4.13) with about 100B tokens of training data. The results demonstrate that carefully designed data curation, including topic-aware data mixture and curriculum, together with synthetic multidisciplinary data, can significantly boost targeted abilities without sacrificing existing capabilities, and the authors release the complete data and code for reproducibility.
Abstract
Continual pre-training (CPT) has been an important approach for adapting language models to specific domains or tasks. To make the CPT approach more traceable, this paper presents a technical report for continually pre-training Llama-3 (8B), which significantly enhances the Chinese language ability and scientific reasoning ability of the backbone model. To enhance the new abilities while retaining the original abilities, we design specific data mixture and curriculum strategies by utilizing existing datasets and synthesizing high-quality datasets. Specifically, we synthesize multidisciplinary scientific question and answer (QA) pairs based on related web pages, and subsequently incorporate these synthetic data to improve the scientific reasoning ability of Llama-3. We refer to the model after CPT as Llama-3-SynE (Synthetic data Enhanced Llama-3). We also present the tuning experiments with a relatively small model -- TinyLlama, and employ the derived findings to train the backbone model. Extensive experiments on a number of evaluation benchmarks show that our approach can largely improve the performance of the backbone models, including both the general abilities (+8.81 on C-Eval and +6.31 on CMMLU) and the scientific reasoning abilities (+12.00 on MATH and +4.13 on SciEval), without hurting the original capacities. Our model, data, and codes are available at https://github.com/RUC-GSAI/Llama-3-SynE.
