Investigating Continual Pretraining in Large Language Models: Insights and Implications

Çağatay Yıldız; Nishaanth Kanna Ravichandran; Nitin Sharma; Matthias Bethge; Beyza Ermis

Investigating Continual Pretraining in Large Language Models: Insights and Implications

Çağatay Yıldız, Nishaanth Kanna Ravichandran, Nitin Sharma, Matthias Bethge, Beyza Ermis

TL;DR

This work investigates continual domain-adaptive pretraining for large language models using a large-scale, multi-domain benchmark (M2D2) spanning 236 domains. By pretraining models sequentially on domain corpora and evaluating with perplexity-based metrics and transfer analyses, the authors reveal scaling laws and nuanced effects of training order on forgetting and knowledge transfer. Key findings show that continual pretraining benefits GPT2-family models and scales with model size, while Llama2-7B often fails to improve due to domain-size constraints; random domain ordering generally enhances retention and forward transfer, whereas similar-ordering fosters domain specialization. The study provides a realistic benchmark and actionable insights for deploying continual pretraining in diverse domains, highlighting architecture-dependent outcomes and the trade-offs between data size, order, and transfer.

Abstract

Continual learning (CL) in large language models (LLMs) is an evolving domain that focuses on developing efficient and sustainable training strategies to adapt models to emerging knowledge and achieve robustness in dynamic environments. Our primary emphasis is on continual domain-adaptive pretraining, a process designed to equip LLMs with the ability to integrate new information from various domains while retaining previously learned knowledge. Since existing works concentrate mostly on continual fine-tuning for a limited selection of downstream tasks or training domains, we introduce a new benchmark designed to measure the adaptability of LLMs to changing pretraining data landscapes. We further examine the impact of model size on learning efficacy and forgetting, as well as how the progression and similarity of emerging domains affect the knowledge transfer within these models. Our findings uncover several key insights: (i) continual pretraining consistently improves <1.5B models studied in this work and is also superior to domain adaptation, (ii) larger models always achieve better perplexity than smaller ones when continually pretrained on the same corpus, (iii) smaller models are particularly sensitive to continual pretraining, showing the most significant rates of both learning and forgetting, (iv) continual pretraining boosts downstream task performance of GPT-2 family, (v) continual pretraining enables LLMs to specialize better when the sequence of domains shows semantic similarity while randomizing training domains leads to better transfer and final performance otherwise. We posit that our research establishes a new benchmark for CL in LLMs, providing a more realistic evaluation of knowledge retention and transfer across diverse domains.

Investigating Continual Pretraining in Large Language Models: Insights and Implications

TL;DR

Abstract

Paper Structure (42 sections, 10 figures, 1 table)

This paper contains 42 sections, 10 figures, 1 table.

Introduction
Methodology
Training
Tasks
Evaluation
Experimental Setup
Models and training
Task ordering
Metrics for assessing continual learning efficacy
Findings and Analysis
What is the efficacy of continual learning?
Continual pretraining consistently improves GPT2 family
Llama2-7B perplexity does not improve by CL since the domains are too small
How does model scale impact learning, forgetting, and final performance?
Larger models always perform the best
...and 27 more sections

Figures (10)

Figure 1: Cosine similarity between our L1 training domains. We also include OpenWebText Gokaslan2019OpenWeb, an open-source replication of the GPT2 pretraining data set. The two big square blocks along the diagonal correspond to Wiki and S2ORC portions.
Figure 2: Average L1-domain embeddings visualized using t-SNE. Wiki domains and natural sciences form two clear clusters. Note that Art and Philosophy are from S2ORC portion, but they are closer to Wiki due to they are social sciences and the rest of S2ORC is natural sciences.
Figure 3: Above panels show test perplexities ($\downarrow$) with different model sizes and training orders. For reference, we include the zero-shot and domain adaptation perplexities. Please see Figure \ref{['tab:main-res-supp']} for results obtained on Wiki and S2ORC domains.
Figure 4: Median backward transfer perplexity during continual pretraining. The grey background highlights Wiki domains. Larger models always achieve better perplexity, which is also aligned with the initial (zero-shot) perplexity. Please see Section \ref{['sec:scale']} for a detailed analysis.
Figure 5: Median improvement in the backward transfer perplexity, normalized by zero-shot perplexity. The grey background highlights Wiki domains. Values smaller than zero indicate negative backward transfer. The backward transfer is at its lowest when the portions are switched in similar-order training (left), then it improves again. Positive backward transfer remains throughout learning in random-order.
...and 5 more figures

Investigating Continual Pretraining in Large Language Models: Insights and Implications

TL;DR

Abstract

Investigating Continual Pretraining in Large Language Models: Insights and Implications

Authors

TL;DR

Abstract

Table of Contents

Figures (10)