G-MAP: General Memory-Augmented Pre-trained Language Model for Domain Tasks

Zhongwei Wan; Yichun Yin; Wei Zhang; Jiaxin Shi; Lifeng Shang; Guangyong Chen; Xin Jiang; Qun Liu

G-MAP: General Memory-Augmented Pre-trained Language Model for Domain Tasks

Zhongwei Wan, Yichun Yin, Wei Zhang, Jiaxin Shi, Lifeng Shang, Guangyong Chen, Xin Jiang, Qun Liu

TL;DR

Domain-adaptive pre-training (DAPT) often causes catastrophic forgetting of general knowledge. G-MAP introduces a memory-augmented transformer layer that fuses a memory cache $M$ built from a frozen general PLM into a domain-specific PLM, using memory-attention and several fusion strategies, notably chunk-based gated memory transfer. Across eight text-classification tasks plus QA and NER in biomedical, CS, news, and reviews domains, G-MAP achieves state-of-the-art results, while frozen memory and memory-attention prove crucial for performance and efficiency. The approach suggests that leveraging forgotten general knowledge can enhance domain-generalization and practical deployment, with potential applicability during pre-training as well as fine-tuning.

Abstract

Recently, domain-specific PLMs have been proposed to boost the task performance of specific domains (e.g., biomedical and computer science) by continuing to pre-train general PLMs with domain-specific corpora. However, this Domain-Adaptive Pre-Training (DAPT; Gururangan et al. (2020)) tends to forget the previous general knowledge acquired by general PLMs, which leads to a catastrophic forgetting phenomenon and sub-optimal performance. To alleviate this problem, we propose a new framework of General Memory Augmented Pre-trained Language Model (G-MAP), which augments the domain-specific PLM by a memory representation built from the frozen general PLM without losing any general knowledge. Specifically, we propose a new memory-augmented layer, and based on it, different augmented strategies are explored to build the memory representation and then adaptively fuse it into the domain-specific PLM. We demonstrate the effectiveness of G-MAP on various domains (biomedical and computer science publications, news, and reviews) and different kinds (text classification, QA, NER) of tasks, and the extensive results show that the proposed G-MAP can achieve SOTA results on all tasks.

G-MAP: General Memory-Augmented Pre-trained Language Model for Domain Tasks

TL;DR

Domain-adaptive pre-training (DAPT) often causes catastrophic forgetting of general knowledge. G-MAP introduces a memory-augmented transformer layer that fuses a memory cache

built from a frozen general PLM into a domain-specific PLM, using memory-attention and several fusion strategies, notably chunk-based gated memory transfer. Across eight text-classification tasks plus QA and NER in biomedical, CS, news, and reviews domains, G-MAP achieves state-of-the-art results, while frozen memory and memory-attention prove crucial for performance and efficiency. The approach suggests that leveraging forgotten general knowledge can enhance domain-generalization and practical deployment, with potential applicability during pre-training as well as fine-tuning.

Abstract

Paper Structure (21 sections, 5 equations, 6 figures, 10 tables)

This paper contains 21 sections, 5 equations, 6 figures, 10 tables.

Introduction
The Method of G-MAP
Overview
Memory-Augmented Layer
Memory-Augmented Strategies
Experiments
Datasets and Metrics
Baselines
Implementation
Results and Analysis
Further Discussion
Effectiveness of Frozen Memory
Effectiveness of Memory-Attention
Layer Selection for Memory-Attention
Apply G-MAP in the Pre-training Stage
...and 6 more sections

Figures (6)

Figure 1: Masked LM (MLM) loss of RoBERTa on 50K randomly sampled documents from each domain before and after DAPT. Figure A and B denote the inference loss of general RoBERTa-base and domain-specific PLMs on the samples of biomedical (BM) and computer science (CS). Figure C means the loss of these models on the samples from the pre-training (PT) corpus of RoBERTa. We report the results of DBLP:conf/acl/GururanganMSLBD20 and lower MLM loss is better.
Figure 2: A framework of G-MAP with the cs-domain task input. PLM-G denotes the frozen general PLM, PLM-D denotes the domain-specific PLM.
Figure 3: Memory-augmented strategies of the G-MAP framework. We take a 6-layer model as an example.
Figure 4: Performance of different layer selections in chunk-based gate memory transfer strategy.
Figure 5: Maksed LM loss for the pre-training stage (a lower value is better). PT denotes samples similar to RoBERTa's pre-training corpus. DAPT(BM) denotes the domain-specific PLM for the biomedical domain. G-MAP(BM) denotes the G-MAP framework with the biomedical-domain backbone. For instance, figure A represents further pre-training of the models on the biomedical pre-training samples and then inferring their MLM loss on the test samples.
...and 1 more figures

G-MAP: General Memory-Augmented Pre-trained Language Model for Domain Tasks

TL;DR

Abstract

G-MAP: General Memory-Augmented Pre-trained Language Model for Domain Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (6)