Table of Contents
Fetching ...

Continual Pre-training of Language Models

Zixuan Ke, Yijia Shao, Haowei Lin, Tatsuya Konishi, Gyuhak Kim, Bing Liu

TL;DR

The paper tackles continual domain-adaptive pre-training of language models to handle evolving domain data without forgetting previously learned knowledge. It introduces DAS, a soft-masking framework that preserves general and past-domain knowledge, a proxy KL-divergence-based method to identify general-knowledge units, and a contrastive objective for integrating knowledge across domains. The approach delivers strong end-task performance and robust knowledge transfer while remaining domain-ID agnostic during end-task fine-tuning. This work enables practical continual DAP-training on streaming or private-domain data, reducing catastrophic forgetting and enabling cross-domain benefits in real-world deployment.

Abstract

Language models (LMs) have been instrumental for the rapid advance of natural language processing. This paper studies continual pre-training of LMs, in particular, continual domain-adaptive pre-training (or continual DAP-training). Existing research has shown that further pre-training an LM using a domain corpus to adapt the LM to the domain can improve the end-task performance in the domain. This paper proposes a novel method to continually DAP-train an LM with a sequence of unlabeled domain corpora to adapt the LM to these domains to improve their end-task performances. The key novelty of our method is a soft-masking mechanism that directly controls the update to the LM. A novel proxy is also proposed to preserve the general knowledge in the original LM. Additionally, it contrasts the representations of the previously learned domain knowledge (including the general knowledge in the pre-trained LM) and the knowledge from the current full network to achieve knowledge integration. The method not only overcomes catastrophic forgetting, but also achieves knowledge transfer to improve end-task performances. Empirical evaluation demonstrates the effectiveness of the proposed method.

Continual Pre-training of Language Models

TL;DR

The paper tackles continual domain-adaptive pre-training of language models to handle evolving domain data without forgetting previously learned knowledge. It introduces DAS, a soft-masking framework that preserves general and past-domain knowledge, a proxy KL-divergence-based method to identify general-knowledge units, and a contrastive objective for integrating knowledge across domains. The approach delivers strong end-task performance and robust knowledge transfer while remaining domain-ID agnostic during end-task fine-tuning. This work enables practical continual DAP-training on streaming or private-domain data, reducing catastrophic forgetting and enabling cross-domain benefits in real-world deployment.

Abstract

Language models (LMs) have been instrumental for the rapid advance of natural language processing. This paper studies continual pre-training of LMs, in particular, continual domain-adaptive pre-training (or continual DAP-training). Existing research has shown that further pre-training an LM using a domain corpus to adapt the LM to the domain can improve the end-task performance in the domain. This paper proposes a novel method to continually DAP-train an LM with a sequence of unlabeled domain corpora to adapt the LM to these domains to improve their end-task performances. The key novelty of our method is a soft-masking mechanism that directly controls the update to the LM. A novel proxy is also proposed to preserve the general knowledge in the original LM. Additionally, it contrasts the representations of the previously learned domain knowledge (including the general knowledge in the pre-trained LM) and the knowledge from the current full network to achieve knowledge integration. The method not only overcomes catastrophic forgetting, but also achieves knowledge transfer to improve end-task performances. Empirical evaluation demonstrates the effectiveness of the proposed method.
Paper Structure (15 sections, 8 equations, 1 figure, 7 tables)

This paper contains 15 sections, 8 equations, 1 figure, 7 tables.

Figures (1)

  • Figure 1: Illustration of DAS. The red cross indicates that the gradient is not used to update the Transformer but only to compute importance. (A) Initialization (Sec. \ref{['sec:initialization']}) computes the importance of units for the general knowledge in the LM. (B) Domain Training (Sec. \ref{['sec:training']}) trains a new domain using the importance scores as soft-masks and contrasts the previously learned knowledge and the full knowledge. (C) Importance Computation (Sec. \ref{['sec:after_training']}) computes the importance of the units for the current domain.