Learning Multi-Modal Whole-Body Control for Real-World Humanoid Robots
Pranay Dugar, Aayam Shrestha, Fangzhou Yu, Bart van Marum, Alan Fern
TL;DR
The paper tackles universal, multi-modal whole-body control for humanoids by proposing the Masked Humanoid Controller (MHC), a single learned policy capable of standing, walking, and mimicking both full and partial body motions from diverse input modalities. Trained in MuJoCo with a curriculum over masked directives, domain randomization, and a rich motion dataset, the MHC learns to generate PD setpoints that track directives while maintaining balance and robustness. Key contributions include a unified framework for multi-modal directives, a detailed data generation and architectural design, a curriculum-driven training strategy, and demonstration of sim-to-real transfer on the Digit V3 robot, along with thorough ablations and generalization analyses. The work advances toward practical, versatile humanoid control by integrating multiple input modalities and showing real-world applicability of learned whole-body control.
Abstract
The foundational capabilities of humanoid robots should include robustly standing, walking, and mimicry of whole and partial-body motions. This work introduces the Masked Humanoid Controller (MHC), which supports all of these capabilities by tracking target trajectories over selected subsets of humanoid state variables while ensuring balance and robustness against disturbances. The MHC is trained in simulation using a carefully designed curriculum that imitates partially masked motions from a library of behaviors spanning standing, walking, optimized reference trajectories, re-targeted video clips, and human motion capture data. It also allows for combining joystick-based control with partial-body motion mimicry. We showcase simulation experiments validating the MHC's ability to execute a wide variety of behaviors from partially-specified target motions. Moreover, we demonstrate sim-to-real transfer on the real-world Digit V3 humanoid robot. To our knowledge, this is the first instance of a learned controller that can realize whole-body control of a real-world humanoid for such diverse multi-modal targets.
