Table of Contents
Fetching ...

DisCoRL: Continual Reinforcement Learning via Policy Distillation

René Traoré, Hugo Caselles-Dupré, Timothée Lesort, Te Sun, Guanghang Cai, Natalia Díaz-Rodríguez, David Filliat

TL;DR

DisCoRL tackles continual reinforcement learning by combining state representation learning with policy distillation to sequentially acquire policies and consolidate them into a single robust policy without task labels. The method learns task-specific SRL encoders, trains policies in the learned representation space, and distills them into a memory-efficient, unified policy using soft labels. It demonstrates near-teacher performance across three sequential tasks in simulation and transfers effectively to a real robot, addressing sim-to-real gaps via domain randomization and robust SRL. The work offers a practical, scalable approach to continual RL in robotics and identifies avenues for refining SRL updates and memory efficiency.

Abstract

In multi-task reinforcement learning there are two main challenges: at training time, the ability to learn different policies with a single model; at test time, inferring which of those policies applying without an external signal. In the case of continual reinforcement learning a third challenge arises: learning tasks sequentially without forgetting the previous ones. In this paper, we tackle these challenges by proposing DisCoRL, an approach combining state representation learning and policy distillation. We experiment on a sequence of three simulated 2D navigation tasks with a 3 wheel omni-directional robot. Moreover, we tested our approach's robustness by transferring the final policy into a real life setting. The policy can solve all tasks and automatically infer which one to run.

DisCoRL: Continual Reinforcement Learning via Policy Distillation

TL;DR

DisCoRL tackles continual reinforcement learning by combining state representation learning with policy distillation to sequentially acquire policies and consolidate them into a single robust policy without task labels. The method learns task-specific SRL encoders, trains policies in the learned representation space, and distills them into a memory-efficient, unified policy using soft labels. It demonstrates near-teacher performance across three sequential tasks in simulation and transfers effectively to a real robot, addressing sim-to-real gaps via domain randomization and robust SRL. The work offers a practical, scalable approach to continual RL in robotics and identifies avenues for refining SRL updates and memory efficiency.

Abstract

In multi-task reinforcement learning there are two main challenges: at training time, the ability to learn different policies with a single model; at test time, inferring which of those policies applying without an external signal. In the case of continual reinforcement learning a third challenge arises: learning tasks sequentially without forgetting the previous ones. In this paper, we tackle these challenges by proposing DisCoRL, an approach combining state representation learning and policy distillation. We experiment on a sequence of three simulated 2D navigation tasks with a 3 wheel omni-directional robot. Moreover, we tested our approach's robustness by transferring the final policy into a real life setting. The policy can solve all tasks and automatically infer which one to run.

Paper Structure

This paper contains 23 sections, 1 equation, 9 figures, 1 table.

Figures (9)

  • Figure 1: Image of the three tasks, in simulation (top) and in real life (bottom) sequentially experienced. Learning is performed in simulation, real life is only used at test time.
  • Figure 2: Overview of our full pipeline for Continual Reinforcement Learning. White cylinders are for datasets, gray squares for environments, and white squares for learning algorithms, whose name corresponds to the model trained. Each task $i$ is learned sequentially and independently by first generating a dataset $D_{R,i}$ with a random policy to train a state representation with an encoder $E_i$ with an SRL method (1), then we use $E_i$ and the environment to learn a policy $\pi_i$ in the state space (2). Once trained, $\pi_i$ is used to create a distillation dataset $D_{\pi_i}$ that acts as a memory of the learned behaviour. All policies are finally compressed into a single policy $\pi_{d:{1,..,i}}$ by merging the current dataset $D_{\pi_i}$ with datasets from previous tasks $D_{\pi_1} \cup ... \cup D_{\pi_{i-1}}$ and using distillation (3).
  • Figure 3:
  • Figure 4: Efficiency (normalized rewards w.r.t the best teacher performance) of policies distilled on 8 seeds using various data generation strategies for each task separately. Each evaluated policy is distilled on 15k tuples of sampled observations and action probabilities, for 4 epochs (see criteria of stopping in section \ref{['subsec:continual_learning']} and Appendix B).
  • Figure 5: Main result: distillation in a continual learning setting of three teacher policies into a single student policy. The resulting policy is able to perform all three tasks both in simulation and in the real world, while minimizing forgetting.
  • ...and 4 more figures