Table of Contents
Fetching ...

Distral: Robust Multitask Reinforcement Learning

Yee Whye Teh, Victor Bapst, Wojciech Marian Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell, Nicolas Heess, Razvan Pascanu

TL;DR

Distral addresses the challenge of data-inefficient multitask reinforcement learning by introducing a shared distilled policy that captures common behavior across tasks. Task policies are regularized toward this distilled policy through KL divergence and entropy, while the distilled policy is learned by distilling information from all tasks. The paper formalizes the framework, derives both soft Q-learning and policy gradient variants, and demonstrates that Distral yields faster learning, better final performance, and greater robustness than standard multitask A3C across simple and complex 3D environments. This approach shifts regularization from parameter space to a semantically meaningful policy space, enabling more reliable transfer and stability in diverse tasks.

Abstract

Most deep reinforcement learning algorithms are data inefficient in complex and rich environments, limiting their applicability to many scenarios. One direction for improving data efficiency is multitask learning with shared neural network parameters, where efficiency may be improved through transfer across related tasks. In practice, however, this is not usually observed, because gradients from different tasks can interfere negatively, making learning unstable and sometimes even less data efficient. Another issue is the different reward schemes between tasks, which can easily lead to one task dominating the learning of a shared model. We propose a new approach for joint training of multiple tasks, which we refer to as Distral (Distill & transfer learning). Instead of sharing parameters between the different workers, we propose to share a "distilled" policy that captures common behaviour across tasks. Each worker is trained to solve its own task while constrained to stay close to the shared policy, while the shared policy is trained by distillation to be the centroid of all task policies. Both aspects of the learning process are derived by optimizing a joint objective function. We show that our approach supports efficient transfer on complex 3D environments, outperforming several related methods. Moreover, the proposed learning process is more robust and more stable---attributes that are critical in deep reinforcement learning.

Distral: Robust Multitask Reinforcement Learning

TL;DR

Distral addresses the challenge of data-inefficient multitask reinforcement learning by introducing a shared distilled policy that captures common behavior across tasks. Task policies are regularized toward this distilled policy through KL divergence and entropy, while the distilled policy is learned by distilling information from all tasks. The paper formalizes the framework, derives both soft Q-learning and policy gradient variants, and demonstrates that Distral yields faster learning, better final performance, and greater robustness than standard multitask A3C across simple and complex 3D environments. This approach shifts regularization from parameter space to a semantically meaningful policy space, enabling more reliable transfer and stability in diverse tasks.

Abstract

Most deep reinforcement learning algorithms are data inefficient in complex and rich environments, limiting their applicability to many scenarios. One direction for improving data efficiency is multitask learning with shared neural network parameters, where efficiency may be improved through transfer across related tasks. In practice, however, this is not usually observed, because gradients from different tasks can interfere negatively, making learning unstable and sometimes even less data efficient. Another issue is the different reward schemes between tasks, which can easily lead to one task dominating the learning of a shared model. We propose a new approach for joint training of multiple tasks, which we refer to as Distral (Distill & transfer learning). Instead of sharing parameters between the different workers, we propose to share a "distilled" policy that captures common behaviour across tasks. Each worker is trained to solve its own task while constrained to stay close to the shared policy, while the shared policy is trained by distillation to be the centroid of all task policies. Both aspects of the learning process are derived by optimizing a joint objective function. We show that our approach supports efficient transfer on complex 3D environments, outperforming several related methods. Moreover, the proposed learning process is more robust and more stable---attributes that are critical in deep reinforcement learning.

Paper Structure

This paper contains 15 sections, 10 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Illustration of the Distral framework.
  • Figure 2: Depiction of the different algorithms and baselines. On the left are two of the Distral algorithms and on the right are the three A3C baselines. Entropy is drawn in brackets as it is optional and only used for KL+ent 2col and KL+ent 1col.
  • Figure 3: Left: Learning curves on two room grid world. The DisTraL agent (blue) learns faster, converges towards better policies, and demonstrates more stable learning overall. Center: Example of tasks. Green is goal position which is uniformly sampled for each task. Starting position is uniformly sampled at the beginning of each episode. Right: depiction of learned distilled policy $\pi_0$ only in the corridor, conditioned on previous action being left/right and no previous reward. Sizes of arrows depict probabilities of actions. Note that up/down actions have negligible probabilities. The model learns to preserve direction of travel in the corridor.
  • Figure 4: Panels A1, C1, D1 show task specific policy performance (averaged across all the tasks) for the maze, navigation and laser-tag tasks, respectively. The $x$-axes are total numbers of training environment steps per task. Panel B1 shows the mean scores obtained with the distilled policies (A3C has no distilled policy, so it is represented by the performance of an untrained network.). For each algorithm, results for the best set of hyperparameters (based on the area under curve) are reported. The bold line is the average over 4 runs, and the colored area the average standard deviation over the tasks. Panels A2, B2, C2, D2 shows the corresponding final performances for the 36 runs of each algorithm ordered by best to worst (9 hyperparameter settings and 4 runs).
  • Figure 5: Scores on the 8 different tasks of the navigation suite. Top two rows show the results with the task specific policies, bottom two rows show the results with the distilled policy. For each algorithm, results for the best set of hyperparameters are reported, as obtained by maximizing the averaged (over tasks and runs) areas under curves. For each algorithm, the 4 thin curves correspond to the 4 runs. The average over these runs is shown in bold. The $x$-axis shows the total number of training environment steps for each task.
  • ...and 2 more figures