Maintaining Plasticity in Deep Continual Learning

Shibhansh Dohare; J. Fernando Hernandez-Garcia; Parash Rahman; A. Rupam Mahmood; Richard S. Sutton

Maintaining Plasticity in Deep Continual Learning

Shibhansh Dohare, J. Fernando Hernandez-Garcia, Parash Rahman, A. Rupam Mahmood, Richard S. Sutton

TL;DR

This work shows that deep neural networks suffer a pronounced loss of plasticity in continual learning, failing to learn new tasks as sequences progress. It provides definitive evidence on both ImageNet-based and MNIST-based continual tasks, analyzes underlying causes linked to initialization-driven properties, and evaluates existing mitigation strategies. The authors propose Continual Backpropagation, which combines gradient descent with selective reinitialization of low-utility units, guided by a two-part utility measure, and demonstrate it robustly preserves plasticity across multiple continual-learning benchmarks and even preliminary continual RL settings. The approach offers a principled path toward maintaining adaptability in non-stationary environments and motivates future work on principled utility design and broader applicability. Overall, CBP addresses a fundamental limitation of train-once-inspired methods and demonstrates a concrete, scalable solution to maintain plasticity in continual learning.

Abstract

Modern deep-learning systems are specialized to problem settings in which training occurs once and then never again, as opposed to continual-learning settings in which training occurs continually. If deep-learning systems are applied in a continual learning setting, then it is well known that they may fail to remember earlier examples. More fundamental, but less well known, is that they may also lose their ability to learn on new examples, a phenomenon called loss of plasticity. We provide direct demonstrations of loss of plasticity using the MNIST and ImageNet datasets repurposed for continual learning as sequences of tasks. In ImageNet, binary classification performance dropped from 89% accuracy on an early task down to 77%, about the level of a linear network, on the 2000th task. Loss of plasticity occurred with a wide range of deep network architectures, optimizers, activation functions, batch normalization, dropout, but was substantially eased by L2-regularization, particularly when combined with weight perturbation. Further, we introduce a new algorithm -- continual backpropagation -- which slightly modifies conventional backpropagation to reinitialize a small fraction of less-used units after each example and appears to maintain plasticity indefinitely.

Maintaining Plasticity in Deep Continual Learning

TL;DR

Abstract

Paper Structure (12 sections, 4 equations, 13 figures, 3 tables, 3 algorithms)

This paper contains 12 sections, 4 equations, 13 figures, 3 tables, 3 algorithms.

Loss of Plasticity
Loss of Plasticity in ImageNet
Robust Loss of Plasticity in Permuted MNIST
Understanding Loss of Plasticity
Existing Deep-Learning Methods for Mitigating Loss of Plasticity
Continual Backpropagation: Stochastic Gradient Descent with Selective Reinitialization
Discussion
Methods
Loss of Plasticity With Different Activations in the Slowly Changing Regression Problem
Extension to a continual RL problem, Slippery Ant
Ablation Study for the utility measure
Continual PPO

Figures (13)

Figure 1: Loss of plasticity on a sequence of ImageNet binary classification tasks. The first plot shows performance over the first ten tasks, which sometimes improved initially before declining. The second plot shows performance over 2000 tasks, over which the loss of plasticity was extensive. The learning algorithm was backpropagation applied in the conventional deep-learning way.
Figure 2: a: Left: An MNIST image with the label '7'; Right: A corresponding permuted image. b: Loss of plasticity in Online Permuted MNIST is robust over step sizes, network sizes, and rates of change.
Figure 3: Evolution of various qualities of a deep network trained via backpropagation with different step sizes on Online Permuted MNIST for different parameter settings. Left: Over time, the percent of dead units in the network increases for all the networks trained with backpropagation. Center: The average magnitude of the weights increases over time for all the networks trained with backpropagation. Right: The effective rank of the representation of the networks trained with backpropagation decreases over time. The results in these three plots are the average over 30 runs. The shaded regions correspond to plus and minus one standard error. For some lines, the shaded region is thinner than the line width due to the standard error being small.
Figure 4: a: Online classification accuracy of various algorithms on Online Permuted MNIST. Only $L^2$-Regularization and shrink-and-perturb have higher accuracy than backpropagation after learning 800 tasks. And, shrink-and-perturb has almost no drop in online classification accuracy over time. The results correspond to the average over 30 independent runs. The shaded regions correspond to plus and minus one standard error. b: Evolution of various qualities of a deep network on Online Permuted MNIST for different deep-learning algorithms. Left: The average magnitude of the weights increases over time for all methods except for $L^2$-Regularization and Shrink and Perturb (S&P). And these are the only two methods with an explicit mechanism to stop the weights from becoming too large. Center: Over time, the percentage of dead units increases in all methods. S&P keeps the number of dead units from growing too much. And surprisingly, Online Norm starts having dead units after around 200 tasks, even though it was explicitly designed to avoid the problem of dead units. Right: The effective rank of the representation of all methods drops over time. Dropout and S&P stop the drop in the effective rank after around 200 tasks. The results in these plots are the average over 30 runs. The shaded regions correspond to plus and minus one standard error. For some lines, the shaded region is thinner than the line width because the standard error was small.
Figure 5: A unit in a network. The utility of a unit at time $t$ is the product of its contribution and adaptation utilities. Adaptation utility is the inverse of the sum of the magnitude of the incoming weights. And, contribution utility is the product of the magnitude of the outgoing weights and activation ($h_{l,i}$) minus its average ($\hat{f}_{l,i}$). $\hat{f}_{l,i}$ is a running average of $h_{l,i}$.
...and 8 more figures

Maintaining Plasticity in Deep Continual Learning

TL;DR

Abstract

Maintaining Plasticity in Deep Continual Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (13)