Table of Contents
Fetching ...

Policy Consolidation for Continual Reinforcement Learning

Christos Kaplanis, Murray Shanahan, Claudia Clopath

TL;DR

The paper tackles catastrophic forgetting in continual reinforcement learning by proposing Policy Consolidation (PC), a framework that enforces memory of the agent's policy across multiple timescales via a cascade of hidden policies and KL-based regularization. By integrating this with PPO-style objectives, PC extends learning stability beyond single-tasks and discrete task switches, and it is evaluated across single-agent alternating tasks and multi-agent self-play settings. Empirical results show PC improves continual learning performance and stability relative to PPO baselines, with insights into cascade depth, task-switch schedules, and hidden-policy behavior. The work advances boundary-agnostic continual RL and highlights future directions in prioritized consolidation and trajectory-based distillation to further enhance behavioral memory across non-stationary environments.

Abstract

We propose a method for tackling catastrophic forgetting in deep reinforcement learning that is \textit{agnostic} to the timescale of changes in the distribution of experiences, does not require knowledge of task boundaries, and can adapt in \textit{continuously} changing environments. In our \textit{policy consolidation} model, the policy network interacts with a cascade of hidden networks that simultaneously remember the agent's policy at a range of timescales and regularise the current policy by its own history, thereby improving its ability to learn without forgetting. We find that the model improves continual learning relative to baselines on a number of continuous control tasks in single-task, alternating two-task, and multi-agent competitive self-play settings.

Policy Consolidation for Continual Reinforcement Learning

TL;DR

The paper tackles catastrophic forgetting in continual reinforcement learning by proposing Policy Consolidation (PC), a framework that enforces memory of the agent's policy across multiple timescales via a cascade of hidden policies and KL-based regularization. By integrating this with PPO-style objectives, PC extends learning stability beyond single-tasks and discrete task switches, and it is evaluated across single-agent alternating tasks and multi-agent self-play settings. Empirical results show PC improves continual learning performance and stability relative to PPO baselines, with insights into cascade depth, task-switch schedules, and hidden-policy behavior. The work advances boundary-agnostic continual RL and highlights future directions in prioritized consolidation and trajectory-based distillation to further enhance behavioral memory across non-stationary environments.

Abstract

We propose a method for tackling catastrophic forgetting in deep reinforcement learning that is \textit{agnostic} to the timescale of changes in the distribution of experiences, does not require knowledge of task boundaries, and can adapt in \textit{continuously} changing environments. In our \textit{policy consolidation} model, the policy network interacts with a cascade of hidden networks that simultaneously remember the agent's policy at a range of timescales and regularise the current policy by its own history, thereby improving its ability to learn without forgetting. We find that the model improves continual learning relative to baselines on a number of continuous control tasks in single-task, alternating two-task, and multi-agent competitive self-play settings.

Paper Structure

This paper contains 28 sections, 9 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: (a) Depiction of synaptic consolidation model (adapted from benna2016computational) (b) Depiction of policy consolidation model. The arrows linking the $\pi_k$s to the $\pi_k^{old}$s represent KL constraints between them, with thicker arrows implying larger constraints, enforcing the policies to be closer together.
  • Figure 2: Reward over time for (a) alternating task and (b) single task runs; comparison of PC agent with fixed KL (with different $\beta$s), clipped (with different clip coefficients) and adaptive KL agents (omitted for some runs since return went very negative). Means and s.d. error bars over 3 runs per setting.
  • Figure 3: Moving averages of mean scores over time in RoboSumo environment of (a) the final version of each model against its past self at different stages of its history, and (b) the PC agents against the baselines at equivalent points in history. Mean scores calculated over 30 runs using 1 for a win, 0.5 for a draw and 0 for a loss. Error bars in (b) are s.d. across three PC runs, which are shown individually in (a).
  • Figure 4: (a) Reward over time of the policies of the networks at different cascade depths on HumanoidSmallLeg-v0, having been trained alternately on HumanoidSmallLeg-v0 and HumanoidBigLeg-v0. (b) Reward over time on alternating Humanoid tasks for different combinations of cascade length and $\omega$.
  • Figure S1: Reward over time using the (a) $D_{\mathrm{KL}}\left(\pi_{k_{old}} || \pi_k\right)$ and (b) $D_{\mathrm{KL}}\left(\pi_k || \pi_{k_{old}}\right)$ constraints.
  • ...and 1 more figures