Efficient Continual Pre-training for Building Domain Specific Large Language Models
Yong Xie, Karan Aggarwal, Aitzaz Ahmad
TL;DR
The paper tackles the high cost of building domain-specific LLMs by proposing domain-adaptive continual pre-training (DACP) and two efficient data-selection strategies. FinPythia-6.9B, trained on a large financial corpus, demonstrates notable improvements on financial tasks with only a fraction of the original data, and the proposed ETS-DACP and ETA-DACP methods further reduce cost while maintaining open-domain capabilities. The authors show that careful data curation and sampling—based on task-similarity, novelty, and diversity—can yield superior in-domain performance (up to ~8% average gains) with as little as 10% of the data. This work provides a practical, cost-effective path for building domain-specific LLMs and broadens understanding of data selection's role in continual pre-training.
Abstract
Large language models (LLMs) have demonstrated remarkable open-domain capabilities. Traditionally, LLMs tailored for a domain are trained from scratch to excel at handling domain-specific tasks. In this work, we explore an alternative strategy of continual pre-training as a means to develop domain-specific LLMs. We introduce FinPythia-6.9B, developed through domain-adaptive continual pre-training on the financial domain. Continual pre-trained FinPythia showcases consistent improvements on financial tasks over the original foundational model. We further explore simple but effective data selection strategies for continual pre-training. Our data selection strategies outperforms vanilla continual pre-training's performance with just 10% of corpus size and cost, without any degradation on open-domain standard tasks. Our work proposes an alternative solution to building domain-specific LLMs from scratch in a cost-effective manner.
