Table of Contents
Fetching ...

Investigating Forgetting in Pre-Trained Representations Through Continual Learning

Yun Luo, Zhen Yang, Xuefeng Bai, Fandong Meng, Jie Zhou, Yue Zhang

TL;DR

The study investigates how continual fine-tuning of pre-trained language representations erodes general knowledge. It introduces probing-based metrics—GD, SynF, SemF—to quantify overall, syntactic, and semantic forgetting across task sequences and model families. Empirical results show pervasive generality destruction, with forgetting correlating across syntactic and semantic probes and being highly sensitive to task order; training general linguistic tasks first and using a hybrid DERPP approach mitigates forgetting and preserves more general knowledge. This work provides a diagnostic and prescriptive framework for mitigating knowledge forgetting in continual learning and informs strategies for incremental knowledge injection into pre-trained representations.

Abstract

Representation forgetting refers to the drift of contextualized representations during continual training. Intuitively, the representation forgetting can influence the general knowledge stored in pre-trained language models (LMs), but the concrete effect is still unclear. In this paper, we study the effect of representation forgetting on the generality of pre-trained language models, i.e. the potential capability for tackling future downstream tasks. Specifically, we design three metrics, including overall generality destruction (GD), syntactic knowledge forgetting (SynF), and semantic knowledge forgetting (SemF), to measure the evolution of general knowledge in continual learning. With extensive experiments, we find that the generality is destructed in various pre-trained LMs, and syntactic and semantic knowledge is forgotten through continual learning. Based on our experiments and analysis, we further get two insights into alleviating general knowledge forgetting: 1) training on general linguistic tasks at first can mitigate general knowledge forgetting; 2) the hybrid continual learning method can mitigate the generality destruction and maintain more general knowledge compared with those only considering rehearsal or regularization.

Investigating Forgetting in Pre-Trained Representations Through Continual Learning

TL;DR

The study investigates how continual fine-tuning of pre-trained language representations erodes general knowledge. It introduces probing-based metrics—GD, SynF, SemF—to quantify overall, syntactic, and semantic forgetting across task sequences and model families. Empirical results show pervasive generality destruction, with forgetting correlating across syntactic and semantic probes and being highly sensitive to task order; training general linguistic tasks first and using a hybrid DERPP approach mitigates forgetting and preserves more general knowledge. This work provides a diagnostic and prescriptive framework for mitigating knowledge forgetting in continual learning and informs strategies for incremental knowledge injection into pre-trained representations.

Abstract

Representation forgetting refers to the drift of contextualized representations during continual training. Intuitively, the representation forgetting can influence the general knowledge stored in pre-trained language models (LMs), but the concrete effect is still unclear. In this paper, we study the effect of representation forgetting on the generality of pre-trained language models, i.e. the potential capability for tackling future downstream tasks. Specifically, we design three metrics, including overall generality destruction (GD), syntactic knowledge forgetting (SynF), and semantic knowledge forgetting (SemF), to measure the evolution of general knowledge in continual learning. With extensive experiments, we find that the generality is destructed in various pre-trained LMs, and syntactic and semantic knowledge is forgotten through continual learning. Based on our experiments and analysis, we further get two insights into alleviating general knowledge forgetting: 1) training on general linguistic tasks at first can mitigate general knowledge forgetting; 2) the hybrid continual learning method can mitigate the generality destruction and maintain more general knowledge compared with those only considering rehearsal or regularization.
Paper Structure (20 sections, 7 equations, 4 figures, 3 tables)

This paper contains 20 sections, 7 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: t-SNE visualization of the representations of natural language inference (NLI) data in our experiment. The representations by fine-tuning on NLI are shown in orange and red, and they drift to the position of green and blue after continual learning.
  • Figure 2: Probing results on BERT through continual training on Order 1. BERT refers to the initialized bert-base-uncased, and others are the probing performance after training on the corresponding task.
  • Figure 3: Probing results on BERT through continual training on Order 2, which is a reverse sequence of Order 1.
  • Figure 4: Evaluations of different continual learning strategies (FT refers to continual learning without any strategy).