ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation

Xialin He; Sirui Xu; Xinyao Li; Runpei Dong; Liuyu Bian; Yu-Xiong Wang; Liang-Yan Gui

ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation

Xialin He, Sirui Xu, Xinyao Li, Runpei Dong, Liuyu Bian, Yu-Xiong Wang, Liang-Yan Gui

TL;DR

Results show that ULTRA generalizes to autonomous, goal-conditioned whole-body loco-manipulation from egocentric perception, consistently outperforming tracking-only baselines with limited skills.

Abstract

Achieving autonomous and versatile whole-body loco-manipulation remains a central barrier to making humanoids practically useful. Yet existing approaches are fundamentally constrained: retargeted data are often scarce or low-quality; methods struggle to scale to large skill repertoires; and, most importantly, they rely on tracking predefined motion references rather than generating behavior from perception and high-level task specifications. To address these limitations, we propose ULTRA, a unified framework with two key components. First, we introduce a physics-driven neural retargeting algorithm that translates large-scale motion capture to humanoid embodiments while preserving physical plausibility for contact-rich interactions. Second, we learn a unified multimodal controller that supports both dense references and sparse task specifications, under sensing ranging from accurate motion-capture state to noisy egocentric visual inputs. We distill a universal tracking policy into this controller, compress motor skills into a compact latent space, and apply reinforcement learning finetuning to expand coverage and improve robustness under out-of-distribution scenarios. This enables coordinated whole-body behavior from sparse intent without test-time reference motions. We evaluate ULTRA in simulation and on a real Unitree G1 humanoid. Results show that ULTRA generalizes to autonomous, goal-conditioned whole-body loco-manipulation from egocentric perception, consistently outperforming tracking-only baselines with limited skills.

ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation

TL;DR

Results show that ULTRA generalizes to autonomous, goal-conditioned whole-body loco-manipulation from egocentric perception, consistently outperforming tracking-only baselines with limited skills.

Abstract

Paper Structure (23 sections, 3 equations, 7 figures, 17 tables)

This paper contains 23 sections, 3 equations, 7 figures, 17 tables.

Introduction
Related Work
Motion Retargeting
Humanoid Whole-body Locomotion
Humanoid Whole-body Loco-Manipulation
Problem Formulation and Preliminaries
Task Interface
Preliminaries
Method
General Motion Tracking for Neural Retargeting
Dense Motion Tracking for Teacher Policy
Multimodal Student Policy
Experimental Results
Experimental Setup
Motion Retargeting
...and 8 more sections

Figures (7)

Figure 2: ULTRA follows four stages: (i) Neural Retargeting: an RL policy converts MoCap data into physically feasible G1 rollouts with augmentation; (ii) Tracking: a privileged teacher tracks these rollouts using full state and references; (iii) Distillation: we distill the teacher into a multimodal student for realistic sensing and sparse goals, with additional RL finetuning; (iv) Deployment: the student runs under real sensing, supporting depth input or MoCap-based state estimation.
Figure 3: Qualitative comparison of our retargeting and OmniRetarget omniretarget at the same frame/sequence. Top: final frame; the baseline shows undesired standing foot placement. Bottom: a contact frame; ours yields more stable contacts.
Figure 4: Zero-shot augmentation with the retargeting policy. Left: trajectory scaling. Right: object scaling. Motions remain plausible, enabling scalable data augmentation.
Figure 5: Left: skill latent under different modalities; aside from tracking, embeddings largely mix, indicating a shared skill space. Right: skill latent cluster by text labels (C0--C4), showing semantic structure.
Figure 6: Sim-to-sim comparison for egocentric goal following. Blue/green: point cloud observation without/with noise; yellow: object goal. Top: without RL finetuning. Bottom: with RL finetuning.
...and 2 more figures

ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation

TL;DR

Abstract

ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)