Table of Contents
Fetching ...

GMT: General Motion Tracking for Humanoid Whole-Body Control

Zixuan Chen, Mazeyu Ji, Xuxin Cheng, Xuanbin Peng, Xue Bin Peng, Xiaolong Wang

TL;DR

GMT tackles the challenge of learning a general, real-world motion-tracking policy for humanoids by coupling Adaptive Sampling with a Motion Mixture-of-Experts. The framework uses a two-stage teacher-student training pipeline, curated mocap data, and refined motion inputs to achieve high-fidelity tracking across diverse skills, with strong sim-to-real transfer aided by domain randomization. Real-world deployment on a medium-sized humanoid demonstrates state-of-the-art performance and broad applicability, including compatibility with MDM-generated motions. Collectively, GMT lays a foundation for scalable, general-purpose whole-body control in humanoid robotics.

Abstract

The ability to track general whole-body motions in the real world is a useful way to build general-purpose humanoid robots. However, achieving this can be challenging due to the temporal and kinematic diversity of the motions, the policy's capability, and the difficulty of coordination of the upper and lower bodies. To address these issues, we propose GMT, a general and scalable motion-tracking framework that trains a single unified policy to enable humanoid robots to track diverse motions in the real world. GMT is built upon two core components: an Adaptive Sampling strategy and a Motion Mixture-of-Experts (MoE) architecture. The Adaptive Sampling automatically balances easy and difficult motions during training. The MoE ensures better specialization of different regions of the motion manifold. We show through extensive experiments in both simulation and the real world the effectiveness of GMT, achieving state-of-the-art performance across a broad spectrum of motions using a unified general policy. Videos and additional information can be found at https://gmt-humanoid.github.io.

GMT: General Motion Tracking for Humanoid Whole-Body Control

TL;DR

GMT tackles the challenge of learning a general, real-world motion-tracking policy for humanoids by coupling Adaptive Sampling with a Motion Mixture-of-Experts. The framework uses a two-stage teacher-student training pipeline, curated mocap data, and refined motion inputs to achieve high-fidelity tracking across diverse skills, with strong sim-to-real transfer aided by domain randomization. Real-world deployment on a medium-sized humanoid demonstrates state-of-the-art performance and broad applicability, including compatibility with MDM-generated motions. Collectively, GMT lays a foundation for scalable, general-purpose whole-body control in humanoid robotics.

Abstract

The ability to track general whole-body motions in the real world is a useful way to build general-purpose humanoid robots. However, achieving this can be challenging due to the temporal and kinematic diversity of the motions, the policy's capability, and the difficulty of coordination of the upper and lower bodies. To address these issues, we propose GMT, a general and scalable motion-tracking framework that trains a single unified policy to enable humanoid robots to track diverse motions in the real world. GMT is built upon two core components: an Adaptive Sampling strategy and a Motion Mixture-of-Experts (MoE) architecture. The Adaptive Sampling automatically balances easy and difficult motions during training. The MoE ensures better specialization of different regions of the motion manifold. We show through extensive experiments in both simulation and the real world the effectiveness of GMT, achieving state-of-the-art performance across a broad spectrum of motions using a unified general policy. Videos and additional information can be found at https://gmt-humanoid.github.io.

Paper Structure

This paper contains 27 sections, 2 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: We deploy the general unified motion tracking policy on a medium-sized humanoid robot. GMT can perform a wide range of motion skills with good stability and generalizability, including (a) stretching, (b) kicking-ball, (c) dancing, (d) high kicking, (e) kungfu, and (f) other dynamic skills such as boxing, running, side stepping, and squatting.
  • Figure 2: Distribution of motion categories in the AMASS dataset. The figure shows the proportion of the total motion duration corresponding to each category.
  • Figure 3: An overview of GMT. Here ${\bm{g}}_t$ denotes the motion target frame, ${\bm{o}}_t$ denotes proprioceptive observation, and ${\bm{e}}_t$ denotes privileged information.
  • Figure 4: Plot of the output of gating network with respect to time on a motion clip composed of a sequence of skills.
  • Figure 5: Top percentile tracking errors on the whole AMASS dataset.
  • ...and 2 more figures