Table of Contents
Fetching ...

Learning Massively Multitask World Models for Continuous Control

Nicklas Hansen, Hao Su, Xiaolong Wang

TL;DR

The study addresses the scalability gap in online reinforcement learning for continual control by introducing MMBench, a 200-task, 10-domain benchmark, and Newt, a language-conditioned multitask world model pretrained on demonstrations and fine-tuned via online learning. By leveraging self-predictive latent dynamics and demonstrations, Newt demonstrates improved data efficiency and multitask performance across many domains, while enabling rapid adaptation to unseen tasks and open-loop control. The work provides strong empirical evidence that online, massively multitask RL with language grounding is feasible and beneficial, and it contributes a rich set of benchmarks, checkpoints, and code to the community. This advances the goal of generalist control agents capable of operating across varied embodiments and tasks with efficient training pipelines.

Abstract

General-purpose control demands agents that act across many tasks and embodiments, yet research on reinforcement learning (RL) for continuous control remains dominated by single-task or offline regimes, reinforcing a view that online RL does not scale. Inspired by the foundation model recipe (large-scale pretraining followed by light RL) we ask whether a single agent can be trained on hundreds of tasks with online interaction. To accelerate research in this direction, we introduce a new benchmark with 200 diverse tasks spanning many domains and embodiments, each with language instructions, demonstrations, and optionally image observations. We then present \emph{Newt}, a language-conditioned multitask world model that is first pretrained on demonstrations to acquire task-aware representations and action priors, and then jointly optimized with online interaction across all tasks. Experiments show that Newt yields better multitask performance and data-efficiency than a set of strong baselines, exhibits strong open-loop control, and enables rapid adaptation to unseen tasks. We release our environments, demonstrations, code for training and evaluation, as well as 200+ checkpoints.

Learning Massively Multitask World Models for Continuous Control

TL;DR

The study addresses the scalability gap in online reinforcement learning for continual control by introducing MMBench, a 200-task, 10-domain benchmark, and Newt, a language-conditioned multitask world model pretrained on demonstrations and fine-tuned via online learning. By leveraging self-predictive latent dynamics and demonstrations, Newt demonstrates improved data efficiency and multitask performance across many domains, while enabling rapid adaptation to unseen tasks and open-loop control. The work provides strong empirical evidence that online, massively multitask RL with language grounding is feasible and beneficial, and it contributes a rich set of benchmarks, checkpoints, and code to the community. This advances the goal of generalist control agents capable of operating across varied embodiments and tasks with efficient training pipelines.

Abstract

General-purpose control demands agents that act across many tasks and embodiments, yet research on reinforcement learning (RL) for continuous control remains dominated by single-task or offline regimes, reinforcing a view that online RL does not scale. Inspired by the foundation model recipe (large-scale pretraining followed by light RL) we ask whether a single agent can be trained on hundreds of tasks with online interaction. To accelerate research in this direction, we introduce a new benchmark with 200 diverse tasks spanning many domains and embodiments, each with language instructions, demonstrations, and optionally image observations. We then present \emph{Newt}, a language-conditioned multitask world model that is first pretrained on demonstrations to acquire task-aware representations and action priors, and then jointly optimized with online interaction across all tasks. Experiments show that Newt yields better multitask performance and data-efficiency than a set of strong baselines, exhibits strong open-loop control, and enables rapid adaptation to unseen tasks. We release our environments, demonstrations, code for training and evaluation, as well as 200+ checkpoints.

Paper Structure

This paper contains 28 sections, 3 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Massively multitask RL. Average score when training a single agent via online interaction on 200 tasks spanning 10 task domains.
  • Figure 2: Tasks. Our proposed benchmark, MMBench, consists of 200 distinct tasks across 10 task domains, including 41 new tasks. See Appendix \ref{['sec:appendix-environments']} for a detailed overview of our task set.
  • Figure 3: MiniArcade. We release a new task suite, dubbed MiniArcade, that consists of 22 tasks spanning 14 unique arcade-style environments (depicted). All tasks support both low-dimensional state representations and RGB observations, and have well-defined reward functions for RL.
  • Figure 4: Sample language instructions. All instructions in MMBench provide a description of embodiment and action space followed by a task description. Refer to Appendix \ref{['sec:appendix-language-instructions']} for more samples.
  • Figure 5: Language embeddings. First 2 principal components of CLIP-ViT/B embeddings shown for a subset of tasks.
  • ...and 8 more figures