Table of Contents
Fetching ...

Expressive Whole-Body Control for Humanoid Robots

Xuxin Cheng, Yandong Ji, Junming Chen, Ruihan Yang, Ge Yang, Xiaolong Wang

TL;DR

The paper introduces ExBody, a goal-conditioned RL framework that enables a humanoid robot to perform expressive, diverse motions by learning from large-scale human mocap data while relaxing lower-body imitation. It combines data curation, motion retargeting to hardware, and a carefully designed reward structure to train a single policy that tracks root motion and upper-body expression, achieving robust sim-to-real transfer on a Unitree H1. Key contributions include (i) a data-driven retargeting and initialization pipeline, (ii) a dual-goal RL objective balancing expressivity with locomotion, and (iii) extensive sim and real-world demonstrations of hands shaking, dancing, and adaptive walking. This approach advances natural, interactive humanoid behavior and highlights the value of leveraging rich human motion datasets for real-world robotics.

Abstract

Can we enable humanoid robots to generate rich, diverse, and expressive motions in the real world? We propose to learn a whole-body control policy on a human-sized robot to mimic human motions as realistic as possible. To train such a policy, we leverage the large-scale human motion capture data from the graphics community in a Reinforcement Learning framework. However, directly performing imitation learning with the motion capture dataset would not work on the real humanoid robot, given the large gap in degrees of freedom and physical capabilities. Our method Expressive Whole-Body Control (Exbody) tackles this problem by encouraging the upper humanoid body to imitate a reference motion, while relaxing the imitation constraint on its two legs and only requiring them to follow a given velocity robustly. With training in simulation and Sim2Real transfer, our policy can control a humanoid robot to walk in different styles, shake hands with humans, and even dance with a human in the real world. We conduct extensive studies and comparisons on diverse motions in both simulation and the real world to show the effectiveness of our approach.

Expressive Whole-Body Control for Humanoid Robots

TL;DR

The paper introduces ExBody, a goal-conditioned RL framework that enables a humanoid robot to perform expressive, diverse motions by learning from large-scale human mocap data while relaxing lower-body imitation. It combines data curation, motion retargeting to hardware, and a carefully designed reward structure to train a single policy that tracks root motion and upper-body expression, achieving robust sim-to-real transfer on a Unitree H1. Key contributions include (i) a data-driven retargeting and initialization pipeline, (ii) a dual-goal RL objective balancing expressivity with locomotion, and (iii) extensive sim and real-world demonstrations of hands shaking, dancing, and adaptive walking. This approach advances natural, interactive humanoid behavior and highlights the value of leveraging rich human motion datasets for real-world robotics.

Abstract

Can we enable humanoid robots to generate rich, diverse, and expressive motions in the real world? We propose to learn a whole-body control policy on a human-sized robot to mimic human motions as realistic as possible. To train such a policy, we leverage the large-scale human motion capture data from the graphics community in a Reinforcement Learning framework. However, directly performing imitation learning with the motion capture dataset would not work on the real humanoid robot, given the large gap in degrees of freedom and physical capabilities. Our method Expressive Whole-Body Control (Exbody) tackles this problem by encouraging the upper humanoid body to imitate a reference motion, while relaxing the imitation constraint on its two legs and only requiring them to follow a given velocity robustly. With training in simulation and Sim2Real transfer, our policy can control a humanoid robot to walk in different styles, shake hands with humans, and even dance with a human in the real world. We conduct extensive studies and comparisons on diverse motions in both simulation and the real world to show the effectiveness of our approach.
Paper Structure (16 sections, 1 equation, 8 figures, 9 tables)

This paper contains 16 sections, 1 equation, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Overview of our framework. Our framework is able to train on data from various sources such as static human motion datasets, generative models, video to pose models that are widely available. After motion retargeting, we acquire a repertoire of motion clips that are compatible with our robot's kinematic structure. We extract expression goal $\mathcal{\mathbf{g}}^e$ and root movement goal $\mathcal{\mathbf{g}}^m$ from the rich features from retargeted motion clips as the goal of our goal-conditioned RL objective. The root movement goal $\mathcal{\mathbf{g}}^m$ can also be intuitively given by joystick commands, enabling convenient deployment in the real world.
  • Figure 2: Dataset visualization of our training data from CMU MoCap. We sample all the motion clips at an incremental of 1s. The resulting number of plotting data points are 1338. We can observe the bias of the distribution from human motions. Such distributions are proven to help policy learning in Sec. \ref{['sec:results']}.
  • Figure 3: Left: During training, we extract a large repertoire of retargeted motion clips and train our ExBody policy. Right: During deployment, we can replay motion that can come from a variety of sources such as static motion datasets, diffusion models, or video-to-skeleton models. For Unitree H1, the robot we use, the shoulder and hip joints have three perpendicular DoFs. Other joints are 1 DoF each. There are 19 DoFs in total. We also notice that some of the retargeted motions exhibit exaggerated movement with robot's lower body, which is why we use ExBody to make it transferrable.
  • Figure 4: Policy's state distribution under different sampling strategies. The green dots are the policy rollout's states. For dataset sampling, we record 20 data points for 4096 environments with randomly sampled arm trajectories from our training set. For random sampling, he red shade represents the randomly sampled $\mathbf{g}^m$ range. For yaw velocity, we do not sample the command, because the policy observes the difference between the desired and actual yaw, and does not explicitly track the angular velocity. The second peak in root height is the initialization bias.
  • Figure 5: We sample 10,000 points of hand positions relative to the robot. Left: retargeted motion dataset. Right: learned ExBody policy rollouts. The upper body movement from the dataset forms a natural distribution for learning.
  • ...and 3 more figures