Table of Contents
Fetching ...

Learning to Continually Learn

Shawn Beaulieu, Lapo Frati, Thomas Miconi, Joel Lehman, Kenneth O. Stanley, Jeff Clune, Nick Cheney

TL;DR

A Neuromodulated Meta-Learning Algorithm (ANML) enables continual learning without catastrophic forgetting at scale: it produces state-of-the-art continual learning performance, sequentially learning as many as 600 classes (over 9,000 SGD updates).

Abstract

Continual lifelong learning requires an agent or model to learn many sequentially ordered tasks, building on previous knowledge without catastrophically forgetting it. Much work has gone towards preventing the default tendency of machine learning models to catastrophically forget, yet virtually all such work involves manually-designed solutions to the problem. We instead advocate meta-learning a solution to catastrophic forgetting, allowing AI to learn to continually learn. Inspired by neuromodulatory processes in the brain, we propose A Neuromodulated Meta-Learning Algorithm (ANML). It differentiates through a sequential learning process to meta-learn an activation-gating function that enables context-dependent selective activation within a deep neural network. Specifically, a neuromodulatory (NM) neural network gates the forward pass of another (otherwise normal) neural network called the prediction learning network (PLN). The NM network also thus indirectly controls selective plasticity (i.e. the backward pass of) the PLN. ANML enables continual learning without catastrophic forgetting at scale: it produces state-of-the-art continual learning performance, sequentially learning as many as 600 classes (over 9,000 SGD updates).

Learning to Continually Learn

TL;DR

A Neuromodulated Meta-Learning Algorithm (ANML) enables continual learning without catastrophic forgetting at scale: it produces state-of-the-art continual learning performance, sequentially learning as many as 600 classes (over 9,000 SGD updates).

Abstract

Continual lifelong learning requires an agent or model to learn many sequentially ordered tasks, building on previous knowledge without catastrophically forgetting it. Much work has gone towards preventing the default tendency of machine learning models to catastrophically forget, yet virtually all such work involves manually-designed solutions to the problem. We instead advocate meta-learning a solution to catastrophic forgetting, allowing AI to learn to continually learn. Inspired by neuromodulatory processes in the brain, we propose A Neuromodulated Meta-Learning Algorithm (ANML). It differentiates through a sequential learning process to meta-learn an activation-gating function that enables context-dependent selective activation within a deep neural network. Specifically, a neuromodulatory (NM) neural network gates the forward pass of another (otherwise normal) neural network called the prediction learning network (PLN). The NM network also thus indirectly controls selective plasticity (i.e. the backward pass of) the PLN. ANML enables continual learning without catastrophic forgetting at scale: it produces state-of-the-art continual learning performance, sequentially learning as many as 600 classes (over 9,000 SGD updates).

Paper Structure

This paper contains 13 sections, 8 figures, 2 algorithms.

Figures (8)

  • Figure 1: The architecture for A Neuromodulated Meta-Learning algorithm (ANML). The prediction network (red) is a normal neural network updated in the inner loop via SGD (or similar). The neuromodulatory network (blue) produces an element-wise gating of the prediction network's forward-pass activations, enabling selective activation (i.e. conditional computation) and indirectly enabling selective plasticity by affecting the gradient updates of the prediction network. The initial weights (at the start of each inner loop) of both the neuromodulatory and prediction network are meta-learned in the outer-loop of optimization. The weights of the neuromodulatory network are not updated in the inner loop, but those of the prediction network are. The example image is from the Omniglot dataset, which is the experimental domain for the experiments in this paper.
  • Figure 2: Meta-test training classification accuracy. The $x$-axis shows the number of sequential tasks/classes in the meta-test training trajectory. Accuracy is calculated with the final prediction network parameters after training sequentially on that full meta-test training trajectory, and on all instances in the that trajectory (i.e. the meta-test training set).
  • Figure 3: Meta-test testing classification accuracy. The $x$-axis shows the number of sequential tasks/classes in the meta-test training trajectory. Accuracy is calculated with the final prediction network parameters after training sequentially on that full meta-test training trajectory, and the evaluation is on held-out (i.e. test) instances of the meta-test classes. Thus, these meta-test test instances were not seen during meta-training or meta-test training. For all trajectory lengths tested, ANML significantly outperforms OML, the pretrained-and-transfer networks, and models trained from scratch.
  • Figure 5: The sparsity of activations before (top row) and after (bottom row) the neuromodulatory gating signal (middle row) has been applied, shown for three random inputs from the meta-test test set, and the mean across all images in the meta-test test set. Colorbars for subfigures are individually normalized to better show min (blue) and max (yellow) activations. Note that post-NM activations are sparse for each individual image, but near-uniformly distributed on average, revealing that NM helps to create sparse, orthogonal representations, and efficiently uses all of its compute resources instead of creating wasteful "dead neurons."
  • Figure 6: 2D t-SNE projections of the latent representations for 10 randomly selected meta-test training classes when the network saw them for the first time (before any labels are provided). Left: OML representations. Middle: PLN representations before being gated by the neuromodulatory network. Right: PLN representations after being gated by the neuromodulatory network. Qualitatively, the post-neuromodulatory activations provide more well-separated clusters. KNN classification accuracy quantitatively shows the improved accuracy that results from such separation (see text).
  • ...and 3 more figures