Hierarchical World Models as Visual Whole-Body Humanoid Controllers
Nicklas Hansen, Jyothir S, Vlad Sobal, Yann LeCun, Xiaolong Wang, Hao Su
TL;DR
Puppeteer introduces a two-agent hierarchical world-model framework for visual whole-body humanoid control, enabling data-driven learning without reward shaping or skill primitives. The low-level tracker, pretrained on a large MoCap dataset, can be reused across tasks, while a high-level puppeteer plans in visual observations to generate end-effector commands for the tracker. Through TD-MPC2-based planning and termination-aware modeling, Puppeteer achieves strong task performance across 8 challenging tasks and produces motions rated as more natural by human evaluators. Ablations show the necessity of planning at both levels and benefit from diverse MoCap data, with promising generalization to longer gaps. The work advances naturalistic, vision-based control for high-dimensional humanoids and provides a new benchmark for visual whole-body locomotion.
Abstract
Whole-body control for humanoids is challenging due to the high-dimensional nature of the problem, coupled with the inherent instability of a bipedal morphology. Learning from visual observations further exacerbates this difficulty. In this work, we explore highly data-driven approaches to visual whole-body humanoid control based on reinforcement learning, without any simplifying assumptions, reward design, or skill primitives. Specifically, we propose a hierarchical world model in which a high-level agent generates commands based on visual observations for a low-level agent to execute, both of which are trained with rewards. Our approach produces highly performant control policies in 8 tasks with a simulated 56-DoF humanoid, while synthesizing motions that are broadly preferred by humans.
