Table of Contents
Fetching ...

UniTracker: Learning Universal Whole-Body Motion Tracker for Humanoid Robots

Kangning Yin, Weishuai Zeng, Ke Fan, Minyue Dai, Zirui Wang, Qiang Zhang, Zheng Tian, Jingbo Wang, Jiangmiao Pang, Weinan Zhang

TL;DR

The paper tackles universal whole-body motion tracking for humanoid robots by introducing UniTracker, a three-stage framework that marries privileged-data teacher policies with a CVAE-based universal student and a fast adaptation module. By modeling a structured latent space conditioned on future motion references, the CVAE enables diverse, globally coherent behaviors under partial observations and improves generalization to unseen motions. A lightweight residual decoder provides rapid, motion-specific adaptation for challenging sequences, achieving robust sim-to-real transfer on a 29-DoF Unitree G1 and tracking over 8k diverse motions. Extensive simulations and real-world experiments demonstrate superior accuracy, robustness to observation noise, and applicability to downstream tasks like text-to-motion generation and video-to-motion estimation. The work contributes a practical, scalable paradigm for expressive, general-purpose humanoid control that integrates data-efficient learning with modular adaptation.

Abstract

Achieving expressive and generalizable whole-body motion control is essential for deploying humanoid robots in real-world environments. In this work, we propose UniTracker, a three-stage training framework that enables robust and scalable motion tracking across a wide range of human behaviors. In the first stage, we train a teacher policy with privileged observations to generate high-quality actions. In the second stage, we introduce a Conditional Variational Autoencoder (CVAE) to model a universal student policy that can be deployed directly on real hardware. The CVAE structure allows the policy to learn a global latent representation of motion, enhancing generalization to unseen behaviors and addressing the limitations of standard MLP-based policies under partial observations. Unlike pure MLPs that suffer from drift in global attributes like orientation, our CVAE-student policy incorporates global intent during training by aligning a partial-observation prior to the full-observation encoder. In the third stage, we introduce a fast adaptation module that fine-tunes the universal policy on harder motion sequences that are difficult to track directly. This adaptation can be performed both for single sequences and in batch mode, further showcasing the flexibility and scalability of our approach. We evaluate UniTracker in both simulation and real-world settings using a Unitree G1 humanoid, demonstrating strong performance in motion diversity, tracking accuracy, and deployment robustness.

UniTracker: Learning Universal Whole-Body Motion Tracker for Humanoid Robots

TL;DR

The paper tackles universal whole-body motion tracking for humanoid robots by introducing UniTracker, a three-stage framework that marries privileged-data teacher policies with a CVAE-based universal student and a fast adaptation module. By modeling a structured latent space conditioned on future motion references, the CVAE enables diverse, globally coherent behaviors under partial observations and improves generalization to unseen motions. A lightweight residual decoder provides rapid, motion-specific adaptation for challenging sequences, achieving robust sim-to-real transfer on a 29-DoF Unitree G1 and tracking over 8k diverse motions. Extensive simulations and real-world experiments demonstrate superior accuracy, robustness to observation noise, and applicability to downstream tasks like text-to-motion generation and video-to-motion estimation. The work contributes a practical, scalable paradigm for expressive, general-purpose humanoid control that integrates data-efficient learning with modular adaptation.

Abstract

Achieving expressive and generalizable whole-body motion control is essential for deploying humanoid robots in real-world environments. In this work, we propose UniTracker, a three-stage training framework that enables robust and scalable motion tracking across a wide range of human behaviors. In the first stage, we train a teacher policy with privileged observations to generate high-quality actions. In the second stage, we introduce a Conditional Variational Autoencoder (CVAE) to model a universal student policy that can be deployed directly on real hardware. The CVAE structure allows the policy to learn a global latent representation of motion, enhancing generalization to unseen behaviors and addressing the limitations of standard MLP-based policies under partial observations. Unlike pure MLPs that suffer from drift in global attributes like orientation, our CVAE-student policy incorporates global intent during training by aligning a partial-observation prior to the full-observation encoder. In the third stage, we introduce a fast adaptation module that fine-tunes the universal policy on harder motion sequences that are difficult to track directly. This adaptation can be performed both for single sequences and in batch mode, further showcasing the flexibility and scalability of our approach. We evaluate UniTracker in both simulation and real-world settings using a Unitree G1 humanoid, demonstrating strong performance in motion diversity, tracking accuracy, and deployment robustness.

Paper Structure

This paper contains 16 sections, 4 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: We deploy our UniTracker on a real humanoid robot, enabling it to perform a diverse range of motions, including (1)squat, (2)golf, (3)high kick, (4)ateral step, (5)dance under external force and (6) challenging motions by fast adaption.
  • Figure 2: An overview of UniTracker: In Stage 1, we train a teacher policy using oracle states via goal-conditioned reinforcement learning. In Stage 2, we distill the policy into a deployable form using a CVAE-based DAgger framework. In Stage 3, we introduce a fast adaptation module for handling challenging motion sequences, implemented using a residual decoder. The training dataset is derived from the AMASS dataset, filtered by PHC to remove physically infeasible motions.
  • Figure 3: The Outcome of Downstream Applications in mujoco: We evaluate text-to-motion generation and video-based motion estimation in the muJoCo simulator.
  • Figure 4: Generalization Ability and Global Consistency of UniTracker in the MuJoCo Simulator
  • Figure 5: Fast Adaption of Challenging Motions