Efficient Continual Pre-training for Building Domain Specific Large Language Models

Yong Xie; Karan Aggarwal; Aitzaz Ahmad

Efficient Continual Pre-training for Building Domain Specific Large Language Models

Yong Xie, Karan Aggarwal, Aitzaz Ahmad

TL;DR

The paper tackles the high cost of building domain-specific LLMs by proposing domain-adaptive continual pre-training (DACP) and two efficient data-selection strategies. FinPythia-6.9B, trained on a large financial corpus, demonstrates notable improvements on financial tasks with only a fraction of the original data, and the proposed ETS-DACP and ETA-DACP methods further reduce cost while maintaining open-domain capabilities. The authors show that careful data curation and sampling—based on task-similarity, novelty, and diversity—can yield superior in-domain performance (up to ~8% average gains) with as little as 10% of the data. This work provides a practical, cost-effective path for building domain-specific LLMs and broadens understanding of data selection's role in continual pre-training.

Abstract

Large language models (LLMs) have demonstrated remarkable open-domain capabilities. Traditionally, LLMs tailored for a domain are trained from scratch to excel at handling domain-specific tasks. In this work, we explore an alternative strategy of continual pre-training as a means to develop domain-specific LLMs. We introduce FinPythia-6.9B, developed through domain-adaptive continual pre-training on the financial domain. Continual pre-trained FinPythia showcases consistent improvements on financial tasks over the original foundational model. We further explore simple but effective data selection strategies for continual pre-training. Our data selection strategies outperforms vanilla continual pre-training's performance with just 10% of corpus size and cost, without any degradation on open-domain standard tasks. Our work proposes an alternative solution to building domain-specific LLMs from scratch in a cost-effective manner.

Efficient Continual Pre-training for Building Domain Specific Large Language Models

TL;DR

Abstract

Paper Structure (31 sections, 4 equations, 7 figures, 5 tables)

This paper contains 31 sections, 4 equations, 7 figures, 5 tables.

Introduction
Methodology
Financial Corpus Curation
Domain-adaptive Continual Pre-training (DACP)
Task-Adaptive Continual Pre-training (TACP)
Towards an Efficient Domain-adaptive Continual Pre-training
Formulation
Efficient Task-Similar Domain-adaptive Continual Pre-training
Efficient Task-Agnostic Domain-adaptive Continual Pre-training
Data Sampling Strategy
Hard Sampling:
Soft Sampling:
Experimental Setup
Evaluation tasks
Training Setup and Infrastructure
...and 16 more sections

Figures (7)

Figure 1: Labeled task data, task-similar domain data and domain corpus in a manifold space.
Figure 2: Training loss of FinPythia-6.9B. FinPythia-6.9B achieves significant loss drop in financial corpus at mild expense of Pile loss.
Figure 3: Distribution of perplexity, similarity and diversity.
Figure 4: Spearman's rank correlation heatmap between perplexity, similarity, and entropy measures.
Figure 5: Average sample quantile of subsets of financial corpus used in ETS-DACP-com and ETS-DACP.
...and 2 more figures

Efficient Continual Pre-training for Building Domain Specific Large Language Models

TL;DR

Abstract

Efficient Continual Pre-training for Building Domain Specific Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)