Learning Massively Multitask World Models for Continuous Control
Nicklas Hansen, Hao Su, Xiaolong Wang
TL;DR
The study addresses the scalability gap in online reinforcement learning for continual control by introducing MMBench, a 200-task, 10-domain benchmark, and Newt, a language-conditioned multitask world model pretrained on demonstrations and fine-tuned via online learning. By leveraging self-predictive latent dynamics and demonstrations, Newt demonstrates improved data efficiency and multitask performance across many domains, while enabling rapid adaptation to unseen tasks and open-loop control. The work provides strong empirical evidence that online, massively multitask RL with language grounding is feasible and beneficial, and it contributes a rich set of benchmarks, checkpoints, and code to the community. This advances the goal of generalist control agents capable of operating across varied embodiments and tasks with efficient training pipelines.
Abstract
General-purpose control demands agents that act across many tasks and embodiments, yet research on reinforcement learning (RL) for continuous control remains dominated by single-task or offline regimes, reinforcing a view that online RL does not scale. Inspired by the foundation model recipe (large-scale pretraining followed by light RL) we ask whether a single agent can be trained on hundreds of tasks with online interaction. To accelerate research in this direction, we introduce a new benchmark with 200 diverse tasks spanning many domains and embodiments, each with language instructions, demonstrations, and optionally image observations. We then present \emph{Newt}, a language-conditioned multitask world model that is first pretrained on demonstrations to acquire task-aware representations and action priors, and then jointly optimized with online interaction across all tasks. Experiments show that Newt yields better multitask performance and data-efficiency than a set of strong baselines, exhibits strong open-loop control, and enables rapid adaptation to unseen tasks. We release our environments, demonstrations, code for training and evaluation, as well as 200+ checkpoints.
