Table of Contents
Fetching ...

ExBody2: Advanced Expressive Humanoid Whole-Body Control

Mazeyu Ji, Xuanbin Peng, Fangchen Liu, Jialong Li, Ge Yang, Xuxin Cheng, Xiaolong Wang

TL;DR

Exbody2 tackles the challenge of expressive, robust humanoid whole-body control by introducing a sim-to-real framework with a generalist policy learned from retargeted diverse motion data and specialist policies finetuned for targeted motions. It couples automated data curation via a feasibility-diversity principle with a decoupled motion-velocity control strategy and a teacher-student RL training pipeline to enable deployable real-world performance. Key contributions include automated lower-body filtering to balance feasibility and diversity, a two-stage training paradigm, and a velocity-based global tracking approach that preserves expressive motion. Empirical results on a Unitree G1 demonstrate superior tracking fidelity and stability in both simulation and real-world tests, with specialist finetuning offering additional gains on challenging tasks and OOD scenarios. These advances push humanoid expressiveness closer to human-level motion while maintaining robustness in real-world environments.

Abstract

This paper tackles the challenge of enabling real-world humanoid robots to perform expressive and dynamic whole-body motions while maintaining overall stability and robustness. We propose Advanced Expressive Whole-Body Control (Exbody2), a method for producing whole-body tracking controllers that are trained on both human motion capture and simulated data and then transferred to the real world. We introduce a technique for decoupling the velocity tracking of the entire body from tracking body landmarks. We use a teacher policy to produce intermediate data that better conforms to the robot's kinematics and to automatically filter away infeasible whole-body motions. This two-step approach enabled us to produce a student policy that can be deployed on the robot that can walk, crouch, and dance. We also provide insight into the trade-off between versatility and the tracking performance on specific motions. We observed significant improvement of tracking performance after fine-tuning on a small amount of data, at the expense of the others.

ExBody2: Advanced Expressive Humanoid Whole-Body Control

TL;DR

Exbody2 tackles the challenge of expressive, robust humanoid whole-body control by introducing a sim-to-real framework with a generalist policy learned from retargeted diverse motion data and specialist policies finetuned for targeted motions. It couples automated data curation via a feasibility-diversity principle with a decoupled motion-velocity control strategy and a teacher-student RL training pipeline to enable deployable real-world performance. Key contributions include automated lower-body filtering to balance feasibility and diversity, a two-stage training paradigm, and a velocity-based global tracking approach that preserves expressive motion. Empirical results on a Unitree G1 demonstrate superior tracking fidelity and stability in both simulation and real-world tests, with specialist finetuning offering additional gains on challenging tasks and OOD scenarios. These advances push humanoid expressiveness closer to human-level motion while maintaining robustness in real-world environments.

Abstract

This paper tackles the challenge of enabling real-world humanoid robots to perform expressive and dynamic whole-body motions while maintaining overall stability and robustness. We propose Advanced Expressive Whole-Body Control (Exbody2), a method for producing whole-body tracking controllers that are trained on both human motion capture and simulated data and then transferred to the real world. We introduce a technique for decoupling the velocity tracking of the entire body from tracking body landmarks. We use a teacher policy to produce intermediate data that better conforms to the robot's kinematics and to automatically filter away infeasible whole-body motions. This two-step approach enabled us to produce a student policy that can be deployed on the robot that can walk, crouch, and dance. We also provide insight into the trade-off between versatility and the tracking performance on specific motions. We observed significant improvement of tracking performance after fine-tuning on a small amount of data, at the expense of the others.

Paper Structure

This paper contains 36 sections, 3 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Exbody2's framework. (a) Motion retargeting adapts raw human motion datasets to fit the humanoid robot's morphology, generating a diverse set of training samples. (b) Automated dataset filtering ranks motions based on tracking errors and selects an optimal subset to train a generalist policy, balancing feasibility and diversity. (c) Specialist policy finetuning refines the generalist model for specific motion categories, such as walking, dancing, and kungfu, improving precision for targeted tasks. (d) The trained policies are deployed on a real humanoid robot, demonstrating expressive, dynamic, and stable whole-body motions in real-world environments.
  • Figure 2: Teacher-student framework for humanoid motion learning, where the teacher uses privileged information, and the student learns from past observations to generate control actions.
  • Figure 3: Impact of dataset filtering thresholds on policy tracking errors. The figure shows the tracking error trends across different dataset filtering thresholds. Policies trained on datasets with filtering thresholds that balance diversity and stability (e.g., $\pi_{\tau=0.150}$) achieve the lowest tracking errors. The base policy exhibits suboptimal performance due to unfiltered data, while overly restrictive thresholds (e.g., $\pi_{\tau=0.075}$) and overly lenient thresholds (e.g., $\pi_{\tau=0.175}$) show reduced effectiveness. We compute the error metric $e(s) = \alpha\, E_{\text{key}}(s) + \beta\, E_{\text{dof}}(s)$ with $\alpha=0.1, \beta=0.9$, assigning heavier weight to the joint-angle term.
  • Figure 4: A sequence of a robot performing the Cha-Cha dance. From top to bottom: the reference motion represented by an avatar, our algorithm's performance in the simulation, and its performance on a real robot. The bottom three rows show the per-frame errors: whole-body joint DoF error, upper-body joint DoF error, and lower-body DoF error, with the blue curve representing Exbody2-Specialist policy finetuned on $\mathcal{D}_{dancing}$ , orange for Exbody2-Scratch policy training from scratch on $\mathcal{D}_{dancing}$, green for our Exbody2-Generalist policy trained on filtered $\mathcal{D}_{CMU}$.
  • Figure 5: Illustration of ExBody2’s multi-source application, demonstrating how VR, RGB, motion capture, and generative models can be combined to produce diverse humanoid behaviors. (a) Motion Datasets: specialized policies (e.g., kung fu, dancing) finetuned on specialist motion datasets. (b) Real-time Whole-body Mimic: real-time replication of human motions from monocular RGB via HybrIK. (c) Motion Synthesis: a CVAE-based approach for extended and varied motion generation. Experiments demonstrate ExBody2’s capability to seamlessly integrate multiple motion sources in both simulation and real-world scenarios.
  • ...and 2 more figures