Scaling Tasks, Not Samples: Mastering Humanoid Control through Multi-Task Model-Based Reinforcement Learning

Shaohuai Liu; Weirui Ye; Yilun Du; Le Xie

Scaling Tasks, Not Samples: Mastering Humanoid Control through Multi-Task Model-Based Reinforcement Learning

Shaohuai Liu, Weirui Ye, Yilun Du, Le Xie

TL;DR

It is argued that effective online learning should scale the number of tasks, rather than the number of samples per task, which reveals a structural advantage of model-based reinforcement learning (MBRL) and establishes task scaling as a critical axis for scalable robotic learning.

Abstract

Developing generalist robots capable of mastering diverse skills remains a central challenge in embodied AI. While recent progress emphasizes scaling model parameters and offline datasets, such approaches are limited in robotics, where learning requires active interaction. We argue that effective online learning should scale the \emph{number of tasks}, rather than the number of samples per task. This regime reveals a structural advantage of model-based reinforcement learning (MBRL). Because physical dynamics are invariant across tasks, a shared world model can aggregate multi-task experience to learn robust, task-agnostic representations. In contrast, model-free methods suffer from gradient interference when tasks demand conflicting actions in similar states. Task diversity therefore acts as a regularizer for MBRL, improving dynamics learning and sample efficiency. We instantiate this idea with \textbf{EfficientZero-Multitask (EZ-M)}, a sample-efficient multi-task MBRL algorithm for online learning. Evaluated on \textbf{HumanoidBench}, a challenging whole-body control benchmark, EZ-M achieves state-of-the-art performance with significantly higher sample efficiency than strong baselines, without extreme parameter scaling. These results establish task scaling as a critical axis for scalable robotic learning. The project website is available \href{https://yewr.github.io/ez_m/}{here}.

Scaling Tasks, Not Samples: Mastering Humanoid Control through Multi-Task Model-Based Reinforcement Learning

TL;DR

Abstract

Paper Structure (20 sections, 3 theorems, 12 equations, 7 figures, 4 tables)

This paper contains 20 sections, 3 theorems, 12 equations, 7 figures, 4 tables.

Introduction
Related Work
Preliminaries
Methods
Model, Search and Learning
Path Consistency
Multi-Task Design Choices
Theoretical Analysis
Problem Formulation
Interference vs. Invariance
Sample Complexity Analysis
Experiments
Experimental settings
Performance Analysis & Knowledge Sharing
Performance scaling with the number of tasks
...and 5 more sections

Key Result

Proposition 5.1

Let $\theta$ be the shared parameters of a model-free policy. If optimal policies for distinct tasks $\tau_i, \tau_j$ require disjoint actions in the same latent state, the expected cosine similarity of their gradients is negative:

Figures (7)

Figure 1: Normalized task-average scores on HumanoidBench-Hard. EZ-M matches and surpasses the strong baselines with environment interactions limited to 1 million. All runs are with 3 random seeds. EZ-M significantly outperforms all baselines.
Figure 2: Overview of EfficientZero-Multitask (EZ-M). Green arrows and notations represent model inference and predictions. Black notations and bold arrows indicate target/grounded values and transitions. (A) Task-sharing model architecture. EZ-M uses a shared model to predict policy, value, and reward while training multiple tasks, with each component conditioned by task embeddings. (B) Balanced data collection. Rollout workers sample tasks with a balanced schedule and execute the corresponding task policy to interact with the environment. This prevents learning collapse from data imbalance and maintains diverse task coverage over time. (C) Multi-task rollout learning. For each sampled trajectory segment, the model is unrolled in latent space conditioned on the task index to produce $\hat{r}_{\tau,k}$, $\hat{v}_{\tau,k}$, and $\hat{p}_{\tau,k}$ along the imagined rollout. Supervision comes from grounded transitions $(a_{\tau,k}, r_{\tau,k})$ and reanalyzed targets $(\pi_{\tau,k}, v_{\tau,k})$, where cross-entropy loss enables a single network to learn consistent predictions across tasks. Blue dashed lines represents temporal consistency and path consistency. (D) Distributed implementation. Data collection (Rollout) feeds trajectories into task-independent replay buffers, while Reanalyze periodically recomputes search-based targets with the latest model. The learner consumes reanalyzed batches to update parameters and broadcasts updated weights back to rollout/reanalyze workers for scalable asynchronous training.
Figure 3: Gradient similarities over training process. We compare the gradient similarities on different model modules between BRC and EZ-M. We choose two task pairs, (h1-walk, h1-run) and (h1-walk, h1-crawl), representing relevant and irrelevant tasks, respectively. dyn, rew, and vp represent the dynamics, reward, and value-policy model in EZ-M. (Left) We validate that the gradient similarities between the relevant task pair are higher than the irrelevant across the training process, indicating positive knowledge transfer. (Right) We showcase that EZ-M modules have higher gradient similarities than BRC modules in the relevant task pair. Curves are slightly smoothed for better visualization.
Figure 4: Performance scale with the number of tasks.y-axis denotes the task-average, normalized episodic return and x-axis represents number of tasks in training simultaneously. We use 3 different random seeds and report 95% confidence intervals.
Figure 5: Ablations on model components.y-axis denotes the task-average, normalized episodic return and x-axis represents different settings. $-$ represents removing a component. TE represents task embedding. Dyn, Rew, VP, Rep represent the dynamics, reward, value-policy, representation model, respectively. -Dyn-Rew-VP_TE indicates removing the task embedding on dynamics, reward, and value-policy models. IER represents independent experience replay. PathCons is the path consistency loss. We use 3 different random seeds and report 95% confidence intervals.
...and 2 more figures

Theorems & Definitions (4)

Proposition 5.1: Gradient Interference in MF yu2020gradient
Lemma 5.2: Dynamics Invariance
Theorem 5.3: Asymptotic Task Scaling Efficiency
proof : Proof

Scaling Tasks, Not Samples: Mastering Humanoid Control through Multi-Task Model-Based Reinforcement Learning

TL;DR

Abstract

Scaling Tasks, Not Samples: Mastering Humanoid Control through Multi-Task Model-Based Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (4)