Table of Contents
Fetching ...

ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation

Xialin He, Sirui Xu, Xinyao Li, Runpei Dong, Liuyu Bian, Yu-Xiong Wang, Liang-Yan Gui

TL;DR

Results show that ULTRA generalizes to autonomous, goal-conditioned whole-body loco-manipulation from egocentric perception, consistently outperforming tracking-only baselines with limited skills.

Abstract

Achieving autonomous and versatile whole-body loco-manipulation remains a central barrier to making humanoids practically useful. Yet existing approaches are fundamentally constrained: retargeted data are often scarce or low-quality; methods struggle to scale to large skill repertoires; and, most importantly, they rely on tracking predefined motion references rather than generating behavior from perception and high-level task specifications. To address these limitations, we propose ULTRA, a unified framework with two key components. First, we introduce a physics-driven neural retargeting algorithm that translates large-scale motion capture to humanoid embodiments while preserving physical plausibility for contact-rich interactions. Second, we learn a unified multimodal controller that supports both dense references and sparse task specifications, under sensing ranging from accurate motion-capture state to noisy egocentric visual inputs. We distill a universal tracking policy into this controller, compress motor skills into a compact latent space, and apply reinforcement learning finetuning to expand coverage and improve robustness under out-of-distribution scenarios. This enables coordinated whole-body behavior from sparse intent without test-time reference motions. We evaluate ULTRA in simulation and on a real Unitree G1 humanoid. Results show that ULTRA generalizes to autonomous, goal-conditioned whole-body loco-manipulation from egocentric perception, consistently outperforming tracking-only baselines with limited skills.

ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation

TL;DR

Results show that ULTRA generalizes to autonomous, goal-conditioned whole-body loco-manipulation from egocentric perception, consistently outperforming tracking-only baselines with limited skills.

Abstract

Achieving autonomous and versatile whole-body loco-manipulation remains a central barrier to making humanoids practically useful. Yet existing approaches are fundamentally constrained: retargeted data are often scarce or low-quality; methods struggle to scale to large skill repertoires; and, most importantly, they rely on tracking predefined motion references rather than generating behavior from perception and high-level task specifications. To address these limitations, we propose ULTRA, a unified framework with two key components. First, we introduce a physics-driven neural retargeting algorithm that translates large-scale motion capture to humanoid embodiments while preserving physical plausibility for contact-rich interactions. Second, we learn a unified multimodal controller that supports both dense references and sparse task specifications, under sensing ranging from accurate motion-capture state to noisy egocentric visual inputs. We distill a universal tracking policy into this controller, compress motor skills into a compact latent space, and apply reinforcement learning finetuning to expand coverage and improve robustness under out-of-distribution scenarios. This enables coordinated whole-body behavior from sparse intent without test-time reference motions. We evaluate ULTRA in simulation and on a real Unitree G1 humanoid. Results show that ULTRA generalizes to autonomous, goal-conditioned whole-body loco-manipulation from egocentric perception, consistently outperforming tracking-only baselines with limited skills.
Paper Structure (23 sections, 3 equations, 7 figures, 17 tables)

This paper contains 23 sections, 3 equations, 7 figures, 17 tables.

Figures (7)

  • Figure 2: ULTRA follows four stages: (i) Neural Retargeting: an RL policy converts MoCap data into physically feasible G1 rollouts with augmentation; (ii) Tracking: a privileged teacher tracks these rollouts using full state and references; (iii) Distillation: we distill the teacher into a multimodal student for realistic sensing and sparse goals, with additional RL finetuning; (iv) Deployment: the student runs under real sensing, supporting depth input or MoCap-based state estimation.
  • Figure 3: Qualitative comparison of our retargeting and OmniRetarget omniretarget at the same frame/sequence. Top: final frame; the baseline shows undesired standing foot placement. Bottom: a contact frame; ours yields more stable contacts.
  • Figure 4: Zero-shot augmentation with the retargeting policy. Left: trajectory scaling. Right: object scaling. Motions remain plausible, enabling scalable data augmentation.
  • Figure 5: Left: skill latent under different modalities; aside from tracking, embeddings largely mix, indicating a shared skill space. Right: skill latent cluster by text labels (C0--C4), showing semantic structure.
  • Figure 6: Sim-to-sim comparison for egocentric goal following. Blue/green: point cloud observation without/with noise; yellow: object goal. Top: without RL finetuning. Bottom: with RL finetuning.
  • ...and 2 more figures