Table of Contents
Fetching ...

Episodic Memory in Lifelong Language Learning

Cyprien de Masson d'Autume, Sebastian Ruder, Lingpeng Kong, Dani Yogatama

TL;DR

The paper tackles lifelong language learning without dataset identifiers, proposing an episodic-memory augmented encoder–decoder that uses sparse experience replay and local adaptation to mitigate catastrophic forgetting. A frozen key-network builds a key-value memory storing past <x_t, y_t> pairs, enabling selective replay and retrieval-based adaptation, with a simple random write strategy to manage memory. Experiments on text classification and QA show that combining sparse replay and memory-guided local adaptation (MbPA++) yields strong performance, approaching a multitask upper bound and demonstrating positive transfer across datasets. The approach emphasizes scalable memory management and retrieval quality as crucial for building a general linguistic intelligence capable of learning across diverse data distributions in a single pass.

Abstract

We introduce a lifelong language learning setup where a model needs to learn from a stream of text examples without any dataset identifier. We propose an episodic memory model that performs sparse experience replay and local adaptation to mitigate catastrophic forgetting in this setup. Experiments on text classification and question answering demonstrate the complementary benefits of sparse experience replay and local adaptation to allow the model to continuously learn from new datasets. We also show that the space complexity of the episodic memory module can be reduced significantly (~50-90%) by randomly choosing which examples to store in memory with a minimal decrease in performance. We consider an episodic memory component as a crucial building block of general linguistic intelligence and see our model as a first step in that direction.

Episodic Memory in Lifelong Language Learning

TL;DR

The paper tackles lifelong language learning without dataset identifiers, proposing an episodic-memory augmented encoder–decoder that uses sparse experience replay and local adaptation to mitigate catastrophic forgetting. A frozen key-network builds a key-value memory storing past <x_t, y_t> pairs, enabling selective replay and retrieval-based adaptation, with a simple random write strategy to manage memory. Experiments on text classification and QA show that combining sparse replay and memory-guided local adaptation (MbPA++) yields strong performance, approaching a multitask upper bound and demonstrating positive transfer across datasets. The approach emphasizes scalable memory management and retrieval quality as crucial for building a general linguistic intelligence capable of learning across diverse data distributions in a single pass.

Abstract

We introduce a lifelong language learning setup where a model needs to learn from a stream of text examples without any dataset identifier. We propose an episodic memory model that performs sparse experience replay and local adaptation to mitigate catastrophic forgetting in this setup. Experiments on text classification and question answering demonstrate the complementary benefits of sparse experience replay and local adaptation to allow the model to continuously learn from new datasets. We also show that the space complexity of the episodic memory module can be reduced significantly (~50-90%) by randomly choosing which examples to store in memory with a minimal decrease in performance. We consider an episodic memory component as a crucial building block of general linguistic intelligence and see our model as a first step in that direction.

Paper Structure

This paper contains 30 sections, 4 equations, 3 figures, 9 tables, 2 algorithms.

Figures (3)

  • Figure 1: An illustration of our model and how it interacts with the key-value memory module during training (left) and inference (right). During training, newly seen examples are used to update the base model and stored in the memory. At certain intervals, we sample examples from the memory and perform gradient updates on the base model (experience replay). During inference, we retrieve examples whose keys are similar to a test example under consideration to fine-tune the model (local adaptation). We use the fine-tuned model to make a prediction and then discard it---keeping the base model for other predictions.
  • Figure 2: Performance on test examples corresponding to the first dataset seen during training as training progresses.
  • Figure 3: $F_1$ scores for MbPA++ and MbPA as the # of local adaptation steps increases.