Table of Contents
Fetching ...

Experience Replay Addresses Loss of Plasticity in Continual Learning

Jiuqi Wang, Rohan Chandra, Shangtong Zhang

TL;DR

This work investigates the loss of plasticity in continual learning and tests whether experience replay can mitigate it. By pairing a replay buffer with a Transformer, the authors show that plasticity loss can be avoided across regression, classification, and policy evaluation tasks without modifying standard backpropagation or regularization. They observe that Transformers with memory either maintain or improve performance over many tasks, while MLPs decline; RNNs and ERMLPs generally fail to leverage the memory. The authors hypothesize that in-context learning in Transformers underlies this effect, and discuss limitations and future directions for scaling and theory.

Abstract

Loss of plasticity is one of the main challenges in continual learning with deep neural networks, where neural networks trained via backpropagation gradually lose their ability to adapt to new tasks and perform significantly worse than their freshly initialized counterparts. The main contribution of this paper is to propose a new hypothesis that experience replay addresses the loss of plasticity in continual learning. Here, experience replay is a form of memory. We provide supporting evidence for this hypothesis. In particular, we demonstrate in multiple different tasks, including regression, classification, and policy evaluation, that by simply adding an experience replay and processing the data in the experience replay with Transformers, the loss of plasticity disappears. Notably, we do not alter any standard components of deep learning. For example, we do not change backpropagation. We do not modify the activation functions. And we do not use any regularization. We conjecture that experience replay and Transformers can address the loss of plasticity because of the in-context learning phenomenon.

Experience Replay Addresses Loss of Plasticity in Continual Learning

TL;DR

This work investigates the loss of plasticity in continual learning and tests whether experience replay can mitigate it. By pairing a replay buffer with a Transformer, the authors show that plasticity loss can be avoided across regression, classification, and policy evaluation tasks without modifying standard backpropagation or regularization. They observe that Transformers with memory either maintain or improve performance over many tasks, while MLPs decline; RNNs and ERMLPs generally fail to leverage the memory. The authors hypothesize that in-context learning in Transformers underlies this effect, and discuss limitations and future directions for scaling and theory.

Abstract

Loss of plasticity is one of the main challenges in continual learning with deep neural networks, where neural networks trained via backpropagation gradually lose their ability to adapt to new tasks and perform significantly worse than their freshly initialized counterparts. The main contribution of this paper is to propose a new hypothesis that experience replay addresses the loss of plasticity in continual learning. Here, experience replay is a form of memory. We provide supporting evidence for this hypothesis. In particular, we demonstrate in multiple different tasks, including regression, classification, and policy evaluation, that by simply adding an experience replay and processing the data in the experience replay with Transformers, the loss of plasticity disappears. Notably, we do not alter any standard components of deep learning. For example, we do not change backpropagation. We do not modify the activation functions. And we do not use any regularization. We conjecture that experience replay and Transformers can address the loss of plasticity because of the in-context learning phenomenon.

Paper Structure

This paper contains 17 sections, 2 equations, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: Mean square error for Slowly-Changing Regression. The losses are averaged in bins of 50,000. The runs are averaged over 20 seeds, and the shaded area displays the standard error.
  • Figure 2: Test accuracy for permuted MNIST. The runs are averaged over 20 seeds, and the shaded area displays the standard error.
  • Figure 3: MSVE for policy evaluation with Boyan's chain. The MSVEs are averaged in bins of 10,000. The runs are averaged over 20 seeds, and the shaded area indicates the standard error.
  • Figure 4: Boyan's chain example. Arrows indicate nonzero transition probabilities.