Table of Contents
Fetching ...

One Model for All Tasks: Leveraging Efficient World Models in Multi-Task Planning

Yuan Pu, Yazhe Niu, Jia Tang, Junyu Xiong, Shuai Hu, Hongsheng Li

TL;DR

This work systematically investigates key architectural designs for extending UniZero and identifies a Mixture-of-Experts (MoE) architecture as the most effective approach, and introduces an online Dynamic Parameter Scaling (DPS) strategy to dynamically allocate model capacity throughout the learning process.

Abstract

In heterogeneous multi-task decision-making, tasks not only exhibit diverse observation and action spaces but also vary substantially in their underlying complexities. While conventional multi-task world models like UniZero excel in single-task settings, we find that when handling a broad and diverse suite of tasks, gradient conflicts and the loss of model plasticity often constrain their sample efficiency. In this work, we address these challenges from two complementary perspectives: the single learning iteration and the overall learning process. First, to mitigate the gradient conflicts, we systematically investigate key architectural designs for extending UniZero. Our investigation identifies a Mixture-of-Experts (MoE) architecture as the most effective approach. We demonstrate, both theoretically and empirically, that this architecture alleviates gradient conflicts by routing task-specific representations to specialized sub-networks. This finding leads to our proposed model, \textit{ScaleZero}. Second, to dynamically allocate model capacity throughout the learning process, we introduce an online Dynamic Parameter Scaling (DPS) strategy. This strategy progressively integrates LoRA adapters in response to task-specific progress, enabling adaptive knowledge retention and parameter expansion. Evaluations on a diverse set of standard benchmarks (Atari, DMC, Jericho) demonstrate that ScaleZero, utilizing solely online reinforcement learning with one model, performs on par with specialized single-task agents. With the DPS strategy, it remains competitive while using just 71.5% of the environment interactions. These findings underscore the potential of ScaleZero for effective multi-task planning. Our code is available at \textcolor{magenta}{https://github.com/opendilab/LightZero}.

One Model for All Tasks: Leveraging Efficient World Models in Multi-Task Planning

TL;DR

This work systematically investigates key architectural designs for extending UniZero and identifies a Mixture-of-Experts (MoE) architecture as the most effective approach, and introduces an online Dynamic Parameter Scaling (DPS) strategy to dynamically allocate model capacity throughout the learning process.

Abstract

In heterogeneous multi-task decision-making, tasks not only exhibit diverse observation and action spaces but also vary substantially in their underlying complexities. While conventional multi-task world models like UniZero excel in single-task settings, we find that when handling a broad and diverse suite of tasks, gradient conflicts and the loss of model plasticity often constrain their sample efficiency. In this work, we address these challenges from two complementary perspectives: the single learning iteration and the overall learning process. First, to mitigate the gradient conflicts, we systematically investigate key architectural designs for extending UniZero. Our investigation identifies a Mixture-of-Experts (MoE) architecture as the most effective approach. We demonstrate, both theoretically and empirically, that this architecture alleviates gradient conflicts by routing task-specific representations to specialized sub-networks. This finding leads to our proposed model, \textit{ScaleZero}. Second, to dynamically allocate model capacity throughout the learning process, we introduce an online Dynamic Parameter Scaling (DPS) strategy. This strategy progressively integrates LoRA adapters in response to task-specific progress, enabling adaptive knowledge retention and parameter expansion. Evaluations on a diverse set of standard benchmarks (Atari, DMC, Jericho) demonstrate that ScaleZero, utilizing solely online reinforcement learning with one model, performs on par with specialized single-task agents. With the DPS strategy, it remains competitive while using just 71.5% of the environment interactions. These findings underscore the potential of ScaleZero for effective multi-task planning. Our code is available at \textcolor{magenta}{https://github.com/opendilab/LightZero}.

Paper Structure

This paper contains 73 sections, 5 theorems, 53 equations, 22 figures, 13 tables, 2 algorithms.

Key Result

Theorem 5.1

In a Mixture-of-Experts (MoE) layer, consider two tasks $t_1$ and $t_2$ with routing weights $\lambda^{t_1}_m, \lambda^{t_2}_m$ over $M$ experts, and per-task gradients on expert $m$, denoted as $g_{t_1}^{(m)}$ and $g_{t_2}^{(m)}$. Let $G$ be the maximum gradient conflict on any single expert, and $

Figures (22)

  • Figure 1: Plasticity collapse in the baseline (UniZero) on a multitask Atari benchmark. While simple tasks like Pong and Hero show stable learning, complex tasks such as Seaquest and ChopperCommand suffer a catastrophic performance collapse in later training (Top). This failure is precisely correlated with a sharp spike in the dormant neuron ratio of the transformer (Bottom Left) and an uncontrolled inflation of the latent state norm (Bottom Right), empirically validating the link between external performance and internal learning dynamics.
  • Figure 2: (a) A systematic exploration of the UniZero design space across five axes: task conditioning, encoder architecture, latent normalization, backbone design, and optimization. This investigation informs the design of our proposed ScaleZero model. (b) A conceptual diagram of Dynamic Parameter Scaling (DPS). DPS progressively expands model capacity by injecting LoRA adapters in stages, triggered by learning progress. This creates a curriculum of model, directing resources toward unsolved tasks while preserving existing knowledge.
  • Figure 3: Performance impact of architectural modifications on the Atari8 multitask benchmark. This ablation across the UniZero design space reveals that replacing the Transformer backbone with a Mixture-of-Experts architecture yields the most significant and consistent performance gains. In contrast, other interventions, with the partial exception of SimNorm, provide marginal or inconsistent benefits. These results underscore the centrality of the MoE's conditional computation in overcoming the limitations of a shared, dense backbone.
  • Figure 4: Interaction cost comparison for ScaleZero vs. ScaleZero-DPS on DMControl. Tha latter reaches the target performance with a 28.5% reduction in the environment cost. Detailed curves are in Appendix \ref{['app:dmc_exp']}.
  • Figure 5: Analysis of representation effective rank. This figure supplements the diagnosis in Figure \ref{['fig:plasticity_loss']}, showing the relationship between game return (Left) and representation effective rank (Right) for the baseline model. The sharp drop in performance correlates strongly with a decline in effective rank, substantiating the claim that performance failure is linked to a collapse in the dimensionality of the model's representation space.
  • ...and 17 more figures

Theorems & Definitions (10)

  • Theorem 5.1: Upper Bound on Gradient Conflict in MoE Layers (informal)
  • Theorem E.1: Upper Bound of Full-layer MoE Gradient Conflict with Sparse/Soft Routing
  • proof
  • Remark 1: Effect of Routing Strategies on Full-layer Gradient Conflict
  • Theorem E.2: Upper Bound on Single-Expert and Full-Layer MoE Gradient Conflict with Uniform Sparse Routing
  • proof
  • Theorem E.3: Expected Gradient Conflict on Task-specific Expert Sets
  • proof
  • Corollary E.4: Expected Full-layer MoE Gradient Conflict for $K$ Tasks
  • proof