Table of Contents
Fetching ...

Hierarchical World Models as Visual Whole-Body Humanoid Controllers

Nicklas Hansen, Jyothir S, Vlad Sobal, Yann LeCun, Xiaolong Wang, Hao Su

TL;DR

Puppeteer introduces a two-agent hierarchical world-model framework for visual whole-body humanoid control, enabling data-driven learning without reward shaping or skill primitives. The low-level tracker, pretrained on a large MoCap dataset, can be reused across tasks, while a high-level puppeteer plans in visual observations to generate end-effector commands for the tracker. Through TD-MPC2-based planning and termination-aware modeling, Puppeteer achieves strong task performance across 8 challenging tasks and produces motions rated as more natural by human evaluators. Ablations show the necessity of planning at both levels and benefit from diverse MoCap data, with promising generalization to longer gaps. The work advances naturalistic, vision-based control for high-dimensional humanoids and provides a new benchmark for visual whole-body locomotion.

Abstract

Whole-body control for humanoids is challenging due to the high-dimensional nature of the problem, coupled with the inherent instability of a bipedal morphology. Learning from visual observations further exacerbates this difficulty. In this work, we explore highly data-driven approaches to visual whole-body humanoid control based on reinforcement learning, without any simplifying assumptions, reward design, or skill primitives. Specifically, we propose a hierarchical world model in which a high-level agent generates commands based on visual observations for a low-level agent to execute, both of which are trained with rewards. Our approach produces highly performant control policies in 8 tasks with a simulated 56-DoF humanoid, while synthesizing motions that are broadly preferred by humans.

Hierarchical World Models as Visual Whole-Body Humanoid Controllers

TL;DR

Puppeteer introduces a two-agent hierarchical world-model framework for visual whole-body humanoid control, enabling data-driven learning without reward shaping or skill primitives. The low-level tracker, pretrained on a large MoCap dataset, can be reused across tasks, while a high-level puppeteer plans in visual observations to generate end-effector commands for the tracker. Through TD-MPC2-based planning and termination-aware modeling, Puppeteer achieves strong task performance across 8 challenging tasks and produces motions rated as more natural by human evaluators. Ablations show the necessity of planning at both levels and benefit from diverse MoCap data, with promising generalization to longer gaps. The work advances naturalistic, vision-based control for high-dimensional humanoids and provides a new benchmark for visual whole-body locomotion.

Abstract

Whole-body control for humanoids is challenging due to the high-dimensional nature of the problem, coupled with the inherent instability of a bipedal morphology. Learning from visual observations further exacerbates this difficulty. In this work, we explore highly data-driven approaches to visual whole-body humanoid control based on reinforcement learning, without any simplifying assumptions, reward design, or skill primitives. Specifically, we propose a hierarchical world model in which a high-level agent generates commands based on visual observations for a low-level agent to execute, both of which are trained with rewards. Our approach produces highly performant control policies in 8 tasks with a simulated 56-DoF humanoid, while synthesizing motions that are broadly preferred by humans.
Paper Structure (18 sections, 4 equations, 12 figures, 6 tables)

This paper contains 18 sections, 4 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Visual whole-body control for humanoids. We present Puppeteer, a hierarchical world model for humanoid control with visual observations. Our method produces natural and human-like motions without any reward design or skill primitives, and traverses challenging terrain.
  • Figure 2: Approach. We pretrain a tracking agent (world model) on human MoCap data using RL; this agent takes proprioceptive information $\mathbf{q}_{t}$ and an abstract reference motion (command) $\mathbf{c}_{t}$ as input, and synthesizes $H$ low-level actions that tracks the reference motion. We then train a high-level puppeteering agent on downstream tasks via online interaction; this agent takes both state $\mathbf{q}_{t}$ and visual information $\mathbf{v}_{t}$ as input, and outputs commands for the tracking agent to execute.
  • Figure 3: MoCap tracking. The low-level tracking agent is trained to track relative end-effector (head, hands, feet) positions of sampled reference motions in 3D space.
  • Figure 4: Tasks. We develop 5 visual whole-body humanoid control tasks with a 56-DoF simulated humanoid (bottom), as well as 3 non-visual tasks (top). See Appendix \ref{['sec:appendix-tasks']} for more details.
  • Figure 5: Learning curves. Episode return vs. environment steps on all 8 tasks from our proposed task suite. Our method generally matches the return of TD-MPC2 on these tasks while producing more natural motions. We only evaluate SAC and DreamerV3 on proprioceptive tasks as they do not achieve any meaningful performance. Average of 10 random seeds; shaded area is $95\%$ CIs.
  • ...and 7 more figures