Table of Contents
Fetching ...

Multi-task Deep Reinforcement Learning with PopArt

Matteo Hessel, Hubert Soyer, Lasse Espeholt, Wojciech Czarnecki, Simon Schmitt, Hado van Hasselt

TL;DR

This work tackles multi-task reinforcement learning by addressing the instability caused by varying reward scales across tasks. It introduces PopArt normalization, which yields scale-invariant actor-critic updates, and extends it to a multi-task setting with per-task value heads while sharing a single policy. Empirical results on Atari-57 and DeepMind Lab (DmLab-30) show state-of-the-art performance, including a single trained agent surpassing human performance on the Atari-57 suite, with data-efficiency enhanced via Pixel Control. The approach also demonstrates robustness to reward sparsity and scales well within a distributed IMPALA framework, offering practical benefits for training versatile, multi-task RL agents.

Abstract

The reinforcement learning community has made great strides in designing algorithms capable of exceeding human performance on specific tasks. These algorithms are mostly trained one task at the time, each new task requiring to train a brand new agent instance. This means the learning algorithm is general, but each solution is not; each agent can only solve the one task it was trained on. In this work, we study the problem of learning to master not one but multiple sequential-decision tasks at once. A general issue in multi-task learning is that a balance must be found between the needs of multiple tasks competing for the limited resources of a single learning system. Many learning algorithms can get distracted by certain tasks in the set of tasks to solve. Such tasks appear more salient to the learning process, for instance because of the density or magnitude of the in-task rewards. This causes the algorithm to focus on those salient tasks at the expense of generality. We propose to automatically adapt the contribution of each task to the agent's updates, so that all tasks have a similar impact on the learning dynamics. This resulted in state of the art performance on learning to play all games in a set of 57 diverse Atari games. Excitingly, our method learned a single trained policy - with a single set of weights - that exceeds median human performance. To our knowledge, this was the first time a single agent surpassed human-level performance on this multi-task domain. The same approach also demonstrated state of the art performance on a set of 30 tasks in the 3D reinforcement learning platform DeepMind Lab.

Multi-task Deep Reinforcement Learning with PopArt

TL;DR

This work tackles multi-task reinforcement learning by addressing the instability caused by varying reward scales across tasks. It introduces PopArt normalization, which yields scale-invariant actor-critic updates, and extends it to a multi-task setting with per-task value heads while sharing a single policy. Empirical results on Atari-57 and DeepMind Lab (DmLab-30) show state-of-the-art performance, including a single trained agent surpassing human performance on the Atari-57 suite, with data-efficiency enhanced via Pixel Control. The approach also demonstrates robustness to reward sparsity and scales well within a distributed IMPALA framework, offering practical benefits for training versatile, multi-task RL agents.

Abstract

The reinforcement learning community has made great strides in designing algorithms capable of exceeding human performance on specific tasks. These algorithms are mostly trained one task at the time, each new task requiring to train a brand new agent instance. This means the learning algorithm is general, but each solution is not; each agent can only solve the one task it was trained on. In this work, we study the problem of learning to master not one but multiple sequential-decision tasks at once. A general issue in multi-task learning is that a balance must be found between the needs of multiple tasks competing for the limited resources of a single learning system. Many learning algorithms can get distracted by certain tasks in the set of tasks to solve. Such tasks appear more salient to the learning process, for instance because of the density or magnitude of the in-task rewards. This causes the algorithm to focus on those salient tasks at the expense of generality. We propose to automatically adapt the contribution of each task to the agent's updates, so that all tasks have a similar impact on the learning dynamics. This resulted in state of the art performance on learning to play all games in a set of 57 diverse Atari games. Excitingly, our method learned a single trained policy - with a single set of weights - that exceeds median human performance. To our knowledge, this was the first time a single agent surpassed human-level performance on this multi-task domain. The same approach also demonstrated state of the art performance on a set of 30 tasks in the 3D reinforcement learning platform DeepMind Lab.

Paper Structure

This paper contains 25 sections, 12 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Summary of results: aggregate scores for IMPALA and PopArt-IMPALA. We report median human normalised score for Atari-57, and mean capped human normalised score for DmLab-30. In Atari, Random and Human refer to whether the trained agent is evaluated with random or human starts. In DmLab-30 the test score includes evaluation on the held-out levels.
  • Figure 1: Atari-57 (reward clipping). Median human normalised score across all Atari levels, as function of the total number of frames seen by the agents across all levels. We compare PopArt-IMPALA to IMPALA and to an additional baseline, MultiHead-IMPALA, that uses task-specific value predictions but no adaptive normalisation. All three agent are trained with the clipped reward scheme.
  • Figure 2: Atari-57 (unclipped): Median human normalised score across all Atari levels, as a function of the total number of frames seen by the agents across all levels. We here compare the same set of agents as in Figure 1, but now all agents are trained without using reward clipping. The approximately flat lines corresponding to the baselines mean no learning at all on at least 50% of the games.
  • Figure 3: Normalisation statistics: Top: learned statistics, without reward clipping, for four distinct Atari games. The shaded region is $[\mu - \sigma, \mu + \sigma]$. Bottom: undiscounted returns.
  • Figure 4: DmLab-30. Mean capped human normalised score of IMPALA (blue) and PopArt-IMPALA (orange), across the DmLab-30 benchmark as function of the number of frames (summed across all levels). Shaded region is bounded by best and worse run among 3 PBT experiments. For reference, we also plot the performance of IMPALA with the limited action set from the original paper (dashed).
  • ...and 5 more figures