Table of Contents
Fetching ...

Continual Deep Reinforcement Learning with Task-Agnostic Policy Distillation

Muhammad Burhan Hafez, Kerim Erekmen

TL;DR

The paper tackles continual reinforcement learning by addressing forgetting, forward transfer, and scalability in task-agnostic settings. It introduces Task-Agnostic Policy Distillation (TAPD), which interleaves a self-supervised, intrinsic-reward-driven task-agnostic exploration phase with the Progress & Compress framework, distilling exploratory knowledge into a knowledge base for faster downstream learning. Key contributions include (i) a task-agnostic phase that learns general exploratory policies, (ii) intrinsic rewards based on forward-model prediction errors to drive exploration without extrinsic rewards, and (iii) empirical evidence on Atari games showing improved sample efficiency and positive forward transfer compared to baselines like Online EWC, Progressive Nets, and Progress & Compress. The approach demonstrates scalable continual RL where task labels are unavailable, achieving robust transfer and competitive or superior performance with a more compact architecture.

Abstract

Central to the development of universal learning systems is the ability to solve multiple tasks without retraining from scratch when new data arrives. This is crucial because each task requires significant training time. Addressing the problem of continual learning necessitates various methods due to the complexity of the problem space. This problem space includes: (1) addressing catastrophic forgetting to retain previously learned tasks, (2) demonstrating positive forward transfer for faster learning, (3) ensuring scalability across numerous tasks, and (4) facilitating learning without requiring task labels, even in the absence of clear task boundaries. In this paper, the Task-Agnostic Policy Distillation (TAPD) framework is introduced. This framework alleviates problems (1)-(4) by incorporating a task-agnostic phase, where an agent explores its environment without any external goal and maximizes only its intrinsic motivation. The knowledge gained during this phase is later distilled for further exploration. Therefore, the agent acts in a self-supervised manner by systematically seeking novel states. By utilizing task-agnostic distilled knowledge, the agent can solve downstream tasks more efficiently, leading to improved sample efficiency. Our code is available at the repository: https://github.com/wabbajack1/TAPD.

Continual Deep Reinforcement Learning with Task-Agnostic Policy Distillation

TL;DR

The paper tackles continual reinforcement learning by addressing forgetting, forward transfer, and scalability in task-agnostic settings. It introduces Task-Agnostic Policy Distillation (TAPD), which interleaves a self-supervised, intrinsic-reward-driven task-agnostic exploration phase with the Progress & Compress framework, distilling exploratory knowledge into a knowledge base for faster downstream learning. Key contributions include (i) a task-agnostic phase that learns general exploratory policies, (ii) intrinsic rewards based on forward-model prediction errors to drive exploration without extrinsic rewards, and (iii) empirical evidence on Atari games showing improved sample efficiency and positive forward transfer compared to baselines like Online EWC, Progressive Nets, and Progress & Compress. The approach demonstrates scalable continual RL where task labels are unavailable, achieving robust transfer and competitive or superior performance with a more compact architecture.

Abstract

Central to the development of universal learning systems is the ability to solve multiple tasks without retraining from scratch when new data arrives. This is crucial because each task requires significant training time. Addressing the problem of continual learning necessitates various methods due to the complexity of the problem space. This problem space includes: (1) addressing catastrophic forgetting to retain previously learned tasks, (2) demonstrating positive forward transfer for faster learning, (3) ensuring scalability across numerous tasks, and (4) facilitating learning without requiring task labels, even in the absence of clear task boundaries. In this paper, the Task-Agnostic Policy Distillation (TAPD) framework is introduced. This framework alleviates problems (1)-(4) by incorporating a task-agnostic phase, where an agent explores its environment without any external goal and maximizes only its intrinsic motivation. The knowledge gained during this phase is later distilled for further exploration. Therefore, the agent acts in a self-supervised manner by systematically seeking novel states. By utilizing task-agnostic distilled knowledge, the agent can solve downstream tasks more efficiently, leading to improved sample efficiency. Our code is available at the repository: https://github.com/wabbajack1/TAPD.

Paper Structure

This paper contains 26 sections, 7 equations, 5 figures, 6 tables, 2 algorithms.

Figures (5)

  • Figure 1: Illustration of the Meta-Environment, representing different game environments as tasks within the context of Atari 2600 games.
  • Figure 2: Overview of our Task-Agnostic Policy Distillation framework. (a) The task-agnostic phase is an abstraction of a process where intermediate alternations between maximizing intrinsic rewards and distillation occur. This process follows the same alternating pattern as in the progress and compress framework. (b) Here, the task-agnostic phase is initially used before alternating between progress (P) and compress (C) phases. When considering the Atari domain, each task can be randomly selected from the Meta-Environment, therefore simulating one game environment. In the C phase, the recently learned policy by the active column (green) is distilled into the knowledge base (KB) (blue) using the KL loss between the active column and KB while protecting KB's old values using Elastic Weight Consolidation (EWC). In the P phase, features learned from previous tasks are reused via lateral connections when learning new tasks. $r^e_k$ and $r^i$ are the extrinsic reward of task $k$ and the task-independent intrinsic reward, respectively. $h$ is a hidden layer.
  • Figure 3: Performance evaluation in the task-agnostic phase. The environment is uniformly sampled, indicating no task-boundaries. Runs averaged over 8 random seeds. Timesteps=300000 between distillation rounds in the task-agnostic phases. Averages are taken over 100 episodes.
  • Figure 4: The learning curves depicted represent the obtained rewards in the progress phase, against Task Agnostic Policy Distillation (TAPD), Online EWC, Progressive Nets, and the reproduced Progress & Compress baseline. Reading from left to right, both performance and entropy are plotted. Tasks are learned in a sequential manner in the following order: Pong, SpaceInvaders, BeamRider, DemonAttack, and AirRaid. TAPD utilizes the distilled knowledge from the task-agnostic phase. Results are averaged over 4 seeds and reflect the averages of scores taken across 100 episodes. Each task is revisited three times (gray vertical lines), allowing for training for 2.5M timesteps on each visit in the progress phase.
  • Figure 5: Analysis of Algorithm Performance: Assessing Forward Transfer through Variance Across Visits and Tasks and Average Performance Across Tasks. Averaged Over 8 Seeds.