Table of Contents
Fetching ...

A Study of Plasticity Loss in On-Policy Deep Reinforcement Learning

Arthur Juliani, Jordan T. Ash

TL;DR

It is demonstrated that plasticity loss is pervasive under domain shift in this regime, and that a number of methods developed to resolve it in other settings fail, and that a class of ``regenerative'' methods are able to consistently mitigate plasticity loss in a variety of contexts.

Abstract

Continual learning with deep neural networks presents challenges distinct from both the fixed-dataset and convex continual learning regimes. One such challenge is plasticity loss, wherein a neural network trained in an online fashion displays a degraded ability to fit new tasks. This problem has been extensively studied in both supervised learning and off-policy reinforcement learning (RL), where a number of remedies have been proposed. Still, plasticity loss has received less attention in the on-policy deep RL setting. Here we perform an extensive set of experiments examining plasticity loss and a variety of mitigation methods in on-policy deep RL. We demonstrate that plasticity loss is pervasive under domain shift in this regime, and that a number of methods developed to resolve it in other settings fail, sometimes even performing worse than applying no intervention at all. In contrast, we find that a class of ``regenerative'' methods are able to consistently mitigate plasticity loss in a variety of contexts, including in gridworld tasks and more challenging environments like Montezuma's Revenge and ProcGen.

A Study of Plasticity Loss in On-Policy Deep Reinforcement Learning

TL;DR

It is demonstrated that plasticity loss is pervasive under domain shift in this regime, and that a number of methods developed to resolve it in other settings fail, and that a class of ``regenerative'' methods are able to consistently mitigate plasticity loss in a variety of contexts.

Abstract

Continual learning with deep neural networks presents challenges distinct from both the fixed-dataset and convex continual learning regimes. One such challenge is plasticity loss, wherein a neural network trained in an online fashion displays a degraded ability to fit new tasks. This problem has been extensively studied in both supervised learning and off-policy reinforcement learning (RL), where a number of remedies have been proposed. Still, plasticity loss has received less attention in the on-policy deep RL setting. Here we perform an extensive set of experiments examining plasticity loss and a variety of mitigation methods in on-policy deep RL. We demonstrate that plasticity loss is pervasive under domain shift in this regime, and that a number of methods developed to resolve it in other settings fail, sometimes even performing worse than applying no intervention at all. In contrast, we find that a class of ``regenerative'' methods are able to consistently mitigate plasticity loss in a variety of contexts, including in gridworld tasks and more challenging environments like Montezuma's Revenge and ProcGen.
Paper Structure (21 sections, 11 figures, 10 tables)

This paper contains 21 sections, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Examples of gridworld environment tasks. The agent (black triangle) begins each episode in the center of the environment. Blue jewels provide +1 reward, red jewels provide -1 reward, and dark grey walls prevent movement. Objects are placed randomly.
  • Figure 2: Performance in the gridworld environment under each of the three distribution shift conditions. The degradation between rounds is evidence of plasticity loss. (A): Epoch-level training performance for permute modification. Dotted vertical lines indicate the end of each round, before a new environmental distribution shift is applied. (B): Round-level training performance for the Permute modification. Data points correspond to normalized mean reward in final 50 episodes of round. Shaded regions correspond to standard error. (C, D): Round-level training performance for the Window and Expand conditions. Top row: Training performance. Bottom row: Test performance.
  • Figure 3: Top: Correlation plots of normalized mean reward in the gridworld environment compared against identified metrics. Each point is averaged over five replicates, and shows the final values produced after a ten-round experiment. Values for each measurement are normalized by its baseline level at initialization. Measurements that significantly correlate ($p < 0.05$) with normalized reward are bolded. First row: Training distribution performance. Second row: Test distribution performance. Bottom: Values of the three predictive metrics during the course of training for each intervention and window change condition as compared to training performance. Shaded regions correspond to standard error. Final values of plots like these correspond to a single point in the correlation plots above.
  • Figure 4: Performance of intervention methods compared to warm-start and reset-all baselines on the Gridworld environment. Final round mean reward is normalized by the performance at end of the first round, and interval bars denote standard error. Top: Train performance. Bottom: Test performance.
  • Figure 5: Performance of intervention methods compared to warm-start and reset-all baselines on the CoinRun environment. Final round mean reward is normalized by the performance at end of the first round, and interval bars denote standard error. Top: Train performance. Bottom: Test performance.
  • ...and 6 more figures