UniTracker: Learning Universal Whole-Body Motion Tracker for Humanoid Robots
Kangning Yin, Weishuai Zeng, Ke Fan, Minyue Dai, Zirui Wang, Qiang Zhang, Zheng Tian, Jingbo Wang, Jiangmiao Pang, Weinan Zhang
TL;DR
The paper tackles universal whole-body motion tracking for humanoid robots by introducing UniTracker, a three-stage framework that marries privileged-data teacher policies with a CVAE-based universal student and a fast adaptation module. By modeling a structured latent space conditioned on future motion references, the CVAE enables diverse, globally coherent behaviors under partial observations and improves generalization to unseen motions. A lightweight residual decoder provides rapid, motion-specific adaptation for challenging sequences, achieving robust sim-to-real transfer on a 29-DoF Unitree G1 and tracking over 8k diverse motions. Extensive simulations and real-world experiments demonstrate superior accuracy, robustness to observation noise, and applicability to downstream tasks like text-to-motion generation and video-to-motion estimation. The work contributes a practical, scalable paradigm for expressive, general-purpose humanoid control that integrates data-efficient learning with modular adaptation.
Abstract
Achieving expressive and generalizable whole-body motion control is essential for deploying humanoid robots in real-world environments. In this work, we propose UniTracker, a three-stage training framework that enables robust and scalable motion tracking across a wide range of human behaviors. In the first stage, we train a teacher policy with privileged observations to generate high-quality actions. In the second stage, we introduce a Conditional Variational Autoencoder (CVAE) to model a universal student policy that can be deployed directly on real hardware. The CVAE structure allows the policy to learn a global latent representation of motion, enhancing generalization to unseen behaviors and addressing the limitations of standard MLP-based policies under partial observations. Unlike pure MLPs that suffer from drift in global attributes like orientation, our CVAE-student policy incorporates global intent during training by aligning a partial-observation prior to the full-observation encoder. In the third stage, we introduce a fast adaptation module that fine-tunes the universal policy on harder motion sequences that are difficult to track directly. This adaptation can be performed both for single sequences and in batch mode, further showcasing the flexibility and scalability of our approach. We evaluate UniTracker in both simulation and real-world settings using a Unitree G1 humanoid, demonstrating strong performance in motion diversity, tracking accuracy, and deployment robustness.
