Metalearning Continual Learning Algorithms

Kazuki Irie; Róbert Csordás; Jürgen Schmidhuber

Metalearning Continual Learning Algorithms

Kazuki Irie, Róbert Csordás, Jürgen Schmidhuber

TL;DR

The paper introduces Automated Continual Learning (ACL), a framework that meta-learns in-context continual-learning algorithms by training self-referential networks (SRWMs) to modify their own learning rules. By formulating continual learning as long-span sequence processing and optimizing a CL-desiderata-based objective, ACL automatically discovers CL algorithms that balance backward and forward transfer without replay memory. Empirical results show ACL can mitigate in-context catastrophic forgetting and outperform several hand-crafted and meta-CL baselines on Split-MNIST, with additional experiments across diverse datasets and task configurations. The work also discusses limitations related to domain generalization, scalability, and interpretability, and highlights the potential of scaling ACL with more diverse data and architectures.

Abstract

General-purpose learning systems should improve themselves in open-ended fashion in ever-changing environments. Conventional learning algorithms for neural networks, however, suffer from catastrophic forgetting (CF), i.e., previously acquired skills are forgotten when a new task is learned. Instead of hand-crafting new algorithms for avoiding CF, we propose Automated Continual Learning (ACL) to train self-referential neural networks to metalearn their own in-context continual (meta)learning algorithms. ACL encodes continual learning (CL) desiderata -- good performance on both old and new tasks -- into its metalearning objectives. Our experiments demonstrate that ACL effectively resolves "in-context catastrophic forgetting," a problem that naive in-context learning algorithms suffer from; ACL-learned algorithms outperform both hand-crafted learning algorithms and popular meta-continual learning methods on the Split-MNIST benchmark in the replay-free setting, and enables continual learning of diverse tasks consisting of multiple standard image classification datasets. We also discuss the current limitations of in-context CL by comparing ACL with state-of-the-art CL methods that leverage pre-trained models. Overall, we bring several novel perspectives into the long-standing problem of CL.

Metalearning Continual Learning Algorithms

TL;DR

Abstract

Paper Structure (34 sections, 4 equations, 5 figures, 9 tables)

This paper contains 34 sections, 4 equations, 5 figures, 9 tables.

Introduction
Background
Continual Learning
Metalearning via Sequence Learning a.k.a. In-Context Learning
Self-Referential Weight Matrices and Recursive Self-Transformers
General description.
Method
Experiments
Two-Task Setting: Comprehensible Study
Analysis: Emergence of In-Context Catastrophic Forgetting
General Evaluation
Discussion
Other Limitations.
Conclusion
Experimental Details
...and 19 more sections

Figures (5)

Figure 1: An illustration of sequence processing in Automated Continual Learning (ACL) using a self-referential weight matrix. The model processes a sequence of task demonstrations (i.e., x/y or input/output pairs corresponding to the task, e.g., training images and their labels for image classification tasks) and updates its own weight matrix (whose initial state is denoted by ${\bm{W}}_0$) as a function of the demo sequence. We denote by ${\bm{W}}_{\mathcal{A}}$, the weight matrix obtained after observing the sequence of Task A demonstrations (blue), and by ${\bm{W}}_{\mathcal{A}, \mathcal{B}}$, the matrix obtained after observing examples of Task A then Task B sequentially (green). This scheme is the same during meta-training and meta-testing. The weight matrices obtained at the task boundaries are used for evaluation: ${\bm{W}}_{\mathcal{A}}$ and a query of Task A (e.g., a test image in the image classification case) are used to predict the target corresponding to the query (e.g., the label corresponding to the test image); and ${\bm{W}}_{\mathcal{A}, \mathcal{B}}$ is used to make a prediction on a query for Task A (backward transfer) and for Task B (forward transfer). During meta-training, the model parameters (${\bm{W}}_0$ in this example) are modified to optimize all such predictions using (a memory-efficient implementation of) backpropagation through time.
Figure 2: Meta-training loss terms reported separately for each dataset (A is Omniglot, B is Mini-ImageNet) and each position in the CL sequence (1 or 2) in the two-task case, yielding 6 curves. Here the ACL backward transfer terms ("ACL bwd" in the legend) are not minimized, corresponding to the ACL/No case in Tables \ref{['tab:two_task_acl']} and \ref{['tab:two_task_acl_extra']}. (a) and (b) represent two typical cases for different random seeds. In both cases, the backward transfer losses diverge (purple and brown curves) when the model "metalearns a dataset," i.e., when the model becomes capable of learning tasks sampled from the corresponding dataset (other colors), causing in-context catastrophic forgetting. Note that blue/orange and green/red curve pairs almost overlap, indicating that when the model metalearns a dataset, tasks sampled from it can be in-context learned regardless of the position.
Figure 3: Visualization of weights during the presentation of Task 1 examples.
Figure 4: Visualization of weights during the presentation of Task 2 examples.
Figure 5: Visualization of weights during the presentation of Task 3 examples.

Metalearning Continual Learning Algorithms

TL;DR

Abstract

Metalearning Continual Learning Algorithms

Authors

TL;DR

Abstract

Table of Contents

Figures (5)