Continual Deep Reinforcement Learning with Task-Agnostic Policy Distillation
Muhammad Burhan Hafez, Kerim Erekmen
TL;DR
The paper tackles continual reinforcement learning by addressing forgetting, forward transfer, and scalability in task-agnostic settings. It introduces Task-Agnostic Policy Distillation (TAPD), which interleaves a self-supervised, intrinsic-reward-driven task-agnostic exploration phase with the Progress & Compress framework, distilling exploratory knowledge into a knowledge base for faster downstream learning. Key contributions include (i) a task-agnostic phase that learns general exploratory policies, (ii) intrinsic rewards based on forward-model prediction errors to drive exploration without extrinsic rewards, and (iii) empirical evidence on Atari games showing improved sample efficiency and positive forward transfer compared to baselines like Online EWC, Progressive Nets, and Progress & Compress. The approach demonstrates scalable continual RL where task labels are unavailable, achieving robust transfer and competitive or superior performance with a more compact architecture.
Abstract
Central to the development of universal learning systems is the ability to solve multiple tasks without retraining from scratch when new data arrives. This is crucial because each task requires significant training time. Addressing the problem of continual learning necessitates various methods due to the complexity of the problem space. This problem space includes: (1) addressing catastrophic forgetting to retain previously learned tasks, (2) demonstrating positive forward transfer for faster learning, (3) ensuring scalability across numerous tasks, and (4) facilitating learning without requiring task labels, even in the absence of clear task boundaries. In this paper, the Task-Agnostic Policy Distillation (TAPD) framework is introduced. This framework alleviates problems (1)-(4) by incorporating a task-agnostic phase, where an agent explores its environment without any external goal and maximizes only its intrinsic motivation. The knowledge gained during this phase is later distilled for further exploration. Therefore, the agent acts in a self-supervised manner by systematically seeking novel states. By utilizing task-agnostic distilled knowledge, the agent can solve downstream tasks more efficiently, leading to improved sample efficiency. Our code is available at the repository: https://github.com/wabbajack1/TAPD.
