Table of Contents
Fetching ...

MetaWorld-X: Hierarchical World Modeling via VLM-Orchestrated Experts for Humanoid Loco-Manipulation

Yutong Shen, Hangxu Liu, Penghui Liu, Jiashuo Luo, Yongkang Zhang, Rex Morvley, Chen Jiang, Jianwei Zhang, Lei Zhang

TL;DR

MetaWorld-X is proposed, a hierarchical world model framework for humanoid control that decomposes complex control problems into a set of specialized expert policies (Specialized Expert Policies, SEP), and an Intelligent Routing Mechanism (IRM) supervised by a Vision-Language Model (VLM), enabling semantic-driven expert composition.

Abstract

Learning natural, stable, and compositionally generalizable whole-body control policies for humanoid robots performing simultaneous locomotion and manipulation (loco-manipulation) remains a fundamental challenge in robotics. Existing reinforcement learning approaches typically rely on a single monolithic policy to acquire multiple skills, which often leads to cross-skill gradient interference and motion pattern conflicts in high-degree-of-freedom systems. As a result, generated behaviors frequently exhibit unnatural movements, limited stability, and poor generalization to complex task compositions. To address these limitations, we propose MetaWorld-X, a hierarchical world model framework for humanoid control. Guided by a divide-and-conquer principle, our method decomposes complex control problems into a set of specialized expert policies (Specialized Expert Policies, SEP). Each expert is trained under human motion priors through imitation-constrained reinforcement learning, introducing biomechanically consistent inductive biases that ensure natural and physically plausible motion generation. Building upon this foundation, we further develop an Intelligent Routing Mechanism (IRM) supervised by a Vision-Language Model (VLM), enabling semantic-driven expert composition. The VLM-guided router dynamically integrates expert policies according to high-level task semantics, facilitating compositional generalization and adaptive execution in multi-stage loco-manipulation tasks.

MetaWorld-X: Hierarchical World Modeling via VLM-Orchestrated Experts for Humanoid Loco-Manipulation

TL;DR

MetaWorld-X is proposed, a hierarchical world model framework for humanoid control that decomposes complex control problems into a set of specialized expert policies (Specialized Expert Policies, SEP), and an Intelligent Routing Mechanism (IRM) supervised by a Vision-Language Model (VLM), enabling semantic-driven expert composition.

Abstract

Learning natural, stable, and compositionally generalizable whole-body control policies for humanoid robots performing simultaneous locomotion and manipulation (loco-manipulation) remains a fundamental challenge in robotics. Existing reinforcement learning approaches typically rely on a single monolithic policy to acquire multiple skills, which often leads to cross-skill gradient interference and motion pattern conflicts in high-degree-of-freedom systems. As a result, generated behaviors frequently exhibit unnatural movements, limited stability, and poor generalization to complex task compositions. To address these limitations, we propose MetaWorld-X, a hierarchical world model framework for humanoid control. Guided by a divide-and-conquer principle, our method decomposes complex control problems into a set of specialized expert policies (Specialized Expert Policies, SEP). Each expert is trained under human motion priors through imitation-constrained reinforcement learning, introducing biomechanically consistent inductive biases that ensure natural and physically plausible motion generation. Building upon this foundation, we further develop an Intelligent Routing Mechanism (IRM) supervised by a Vision-Language Model (VLM), enabling semantic-driven expert composition. The VLM-guided router dynamically integrates expert policies according to high-level task semantics, facilitating compositional generalization and adaptive execution in multi-stage loco-manipulation tasks.
Paper Structure (18 sections, 13 equations, 9 figures, 4 tables)

This paper contains 18 sections, 13 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: MetaWorld-X is a hierarchical world model framework for whole-body humanoid control, featuring a Specialized Expert Policy (SEP) module for skill decomposition and an Intelligent Routing Mechanism (IRM) guided by a Vision-Language Model (VLM). By orchestrating a Mixture-of-Experts (MoE) architecture, it decomposes complex loco-manipulation tasks into human-motion-informed primitives and dynamically composes them via semantic routing.
  • Figure 2: MetaWorld-X achieves natural humanoid control through the dynamic orchestration of expert policies guided by a Vision-Language Model (VLM). The framework consists of two core modules: the Skill Expert Pool (SEP) trains specialized policy networks for fundamental motor skills using human motion data, while the Intelligent Router Module (IRM) employs a VLM as a supervisory teacher. The VLM guides the router’s training via few-shot inference, enabling a seamless transition from supervised learning to autonomous operation.
  • Figure 3: Architecture of the SEP module. We project human motion priors to the robot's configuration space via operator $\mathcal{M}$, then utilize alignment operator $\mathcal{A}$ to compute tracking signals for expert policies $\pi_{\theta_i}$. By optimizing $\mathcal{J}_{\text{SEP}}$ with a dynamic weighting mechanism that prioritizes poorly-tracked joints, the module yields the converged expert policies.
  • Figure 4: Example of VLM prompt for Semantic Router.
  • Figure 5: We evaluate the performance of four foundational motor skills: Walking, Running, Standing, and Sitting, after 50w training steps. The dashed lines in the learning curves qualitatively represent the reward thresholds associated with the successful execution of each skill.
  • ...and 4 more figures