Table of Contents
Fetching ...

PhyGile: Physics-Prefix Guided Motion Generation for Agile General Humanoid Motion Tracking

Jiacheng Bao, Haoran Yang, Yucheng Xin, Junhong Liu, Yuecheng Xu, Han Liang, Pengfei Han, Xiaoguang Ma, Dong Wang, Bin Zhao

Abstract

Humanoid robots are expected to execute agile and expressive whole-body motions in real-world settings. Existing text-to-motion generation models are predominantly trained on captured human motion datasets, whose priors assume human biomechanics, actuation, mass distribution, and contact strategies. When such motions are directly retargeted to humanoid robots, the resulting trajectories may satisfy geometric constraints (e.g., joint limits and pose continuity) and appear kinematically reasonable. However, they frequently violate the physical feasibility required for real-world execution. To address these issues, we present PhyGile, a unified framework that closes the loop between robot-native motion generation and General Motion Tracking (GMT). PhyGile performs physics-prefix-guided robot-native motion generation at inference time, directly generating robot-native motions in a 262-dimensional skeletal space with physics-guided prefixes, thereby eliminating inference-time retargeting artifacts and reducing generation-execution discrepancies. Before physics-prefix adaptation, we train the GMT controller with a curriculum-based mixture-of-experts scheme, followed by post-training on unlabeled motion data to improve robustness over large-scale robot motions. During physics-prefix adaptation, the GMT controller is further fine-tuned with generated objectives under physics-derived prefixes, enabling agile and stable execution of complex motions on real robots. Extensive offline and real-robot experiments demonstrate that PhyGile expands the frontier of text-driven humanoid control, enabling stable tracking of agile, highly difficult whole-body motions that go well beyond walking and low-dynamic motions typically achieved by prior methods.

PhyGile: Physics-Prefix Guided Motion Generation for Agile General Humanoid Motion Tracking

Abstract

Humanoid robots are expected to execute agile and expressive whole-body motions in real-world settings. Existing text-to-motion generation models are predominantly trained on captured human motion datasets, whose priors assume human biomechanics, actuation, mass distribution, and contact strategies. When such motions are directly retargeted to humanoid robots, the resulting trajectories may satisfy geometric constraints (e.g., joint limits and pose continuity) and appear kinematically reasonable. However, they frequently violate the physical feasibility required for real-world execution. To address these issues, we present PhyGile, a unified framework that closes the loop between robot-native motion generation and General Motion Tracking (GMT). PhyGile performs physics-prefix-guided robot-native motion generation at inference time, directly generating robot-native motions in a 262-dimensional skeletal space with physics-guided prefixes, thereby eliminating inference-time retargeting artifacts and reducing generation-execution discrepancies. Before physics-prefix adaptation, we train the GMT controller with a curriculum-based mixture-of-experts scheme, followed by post-training on unlabeled motion data to improve robustness over large-scale robot motions. During physics-prefix adaptation, the GMT controller is further fine-tuned with generated objectives under physics-derived prefixes, enabling agile and stable execution of complex motions on real robots. Extensive offline and real-robot experiments demonstrate that PhyGile expands the frontier of text-driven humanoid control, enabling stable tracking of agile, highly difficult whole-body motions that go well beyond walking and low-dynamic motions typically achieved by prior methods.
Paper Structure (40 sections, 19 equations, 4 figures, 13 tables)

This paper contains 40 sections, 19 equations, 4 figures, 13 tables.

Figures (4)

  • Figure 1: PhyGile translates natural language commands into agile and expressive whole-body motions on humanoid robots, thereby enabling stable real-world execution of highly-difficult motions. Project Page:https://baojch.github.io/phygile-page/
  • Figure 2: Overview of PhyGile.(Left) GMT: A two-stage MoE tracker is first trained with curriculum-constrained routing to induce expert specialization, followed by global soft post-training with dynamic expert expansion to absorb persistently difficult motions. (Right) Generation of Diffusion Policy: A TP-MoE–conditioned robot-native diffusion model generating 262D robot motion sequences from text. (Center) Motion Generation Fine-tuning: Executable motion prefixes are concatenated with newly generated 1-second continuations and validated by pretrained GMT. Closed-loop simulation refinement further enforces dynamic feasibility and improves consistency between generated and trackable motions, and the fine-tuned GMT policy is deployed on real robots.
  • Figure 3: Qualitative results on real robots demonstrating agile, whole-body motion generation across diverse behaviors.
  • Figure 4: Ablation on key design choices. (a) Generation module: varying the top-k selected experts improves performance up to k=6 (peak R@3), with a slight drop at k=8. (b) GMT module: increasing module size consistently raises the success rate; the full PhyGile outperforms PhyGile-C and PhyGile-CFM across all sizes.