Addressing Loss of Plasticity and Catastrophic Forgetting in Continual Learning

Mohamed Elsayed; A. Rupam Mahmood

Addressing Loss of Plasticity and Catastrophic Forgetting in Continual Learning

Mohamed Elsayed, A. Rupam Mahmood

TL;DR

The paper tackles the twin problems of forgetting and loss of plasticity in continual learning under streaming, non-stationary conditions. It introduces Utility-based Perturbed Gradient Descent (UPGD), which gates gradient updates by a learned utility and perturbs less useful units to rejuvenate plasticity, with a scalable second-order utility approximation. The method is analyzed for convergence and evaluated across diverse streaming and reinforcement learning tasks, demonstrating superior handling of both forgetting and plasticity compared to replay-free baselines. The results suggest UPGD enables more robust representation learning in on-device and non-stationary settings, with practical impact for continual learning in real-world AI systems.

Abstract

Deep representation learning methods struggle with continual learning, suffering from both catastrophic forgetting of useful units and loss of plasticity, often due to rigid and unuseful units. While many methods address these two issues separately, only a few currently deal with both simultaneously. In this paper, we introduce Utility-based Perturbed Gradient Descent (UPGD) as a novel approach for the continual learning of representations. UPGD combines gradient updates with perturbations, where it applies smaller modifications to more useful units, protecting them from forgetting, and larger modifications to less useful units, rejuvenating their plasticity. We use a challenging streaming learning setup where continual learning problems have hundreds of non-stationarities and unknown task boundaries. We show that many existing methods suffer from at least one of the issues, predominantly manifested by their decreasing accuracy over tasks. On the other hand, UPGD continues to improve performance and surpasses or is competitive with all methods in all problems. Finally, in extended reinforcement learning experiments with PPO, we show that while Adam exhibits a performance drop after initial learning, UPGD avoids it by addressing both continual learning issues.

Addressing Loss of Plasticity and Catastrophic Forgetting in Continual Learning

TL;DR

Abstract

Paper Structure (40 sections, 2 theorems, 37 equations, 23 figures, 3 tables, 6 algorithms)

This paper contains 40 sections, 2 theorems, 37 equations, 23 figures, 3 tables, 6 algorithms.

Challenges of Continual Learning
Related Works
Method
Scalable Approximation of the True Utility
Utility-based Perturbed Gradient Descent (UPGD)
Forgetting and Plasticity Evaluation Metrics
Experiments
Quality of the Approximated Utilities
UPGD Against Loss of Plasticity
UPGD Against Catastrophic Forgetting
UPGD Against Loss of Plasticity and Catastrophic Forgetting
UPGD against Policy Collapse
Conclusion
Convergence Analysis for UPGD and Non-protecting UPGD
Non-protecting Utility-based Perturbed Gradient Descent
...and 25 more sections

Key Result

Theorem 1

If the second-order off-diagonal terms in all layers in a neural network except for the last one are zero and all higher-order derivatives are zero, the true weight utility for the weight $ij$ at the layer $l$ can be propagated using the following recursive formulation: where

Figures (23)

Figure 1: (a) Adam suffers from catastrophic forgetting and hence hardly improves performance. (b & c) Adam loses plasticity as newer and newer tasks are presented and performs much worse than Adam with restarts later. In contrast, our proposed method, UPGD, quickly learns and maintains plasticity throughout learning. See Appendix \ref{['appendix:story-experiment']} for experimental details.
Figure 2: Rank correlation between the true utility and approximated utility.
Figure 3: Performance of methods on the Input-permuted MNIST problem.
Figure 4: Each method's average plasticity against average accuracy of on Input-permuted MNIST.
Figure 5: Diagnostic statistics on Input-permuted MNIST. The percentage of zero activations, $\ell_0$-norm and $\ell_1$-norm of the gradients, and $\ell_1$-norm of the weights are shown. We stacked the elements from the network gradients or weights into vectors to compute each norm at every sample.
...and 18 more figures

Theorems & Definitions (4)

Theorem 1
proof
Theorem 2
proof

Addressing Loss of Plasticity and Catastrophic Forgetting in Continual Learning

TL;DR

Abstract

Addressing Loss of Plasticity and Catastrophic Forgetting in Continual Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (23)

Theorems & Definitions (4)