Table of Contents
Fetching ...

MOSAIC: Bridging the Sim-to-Real Gap in Generalist Humanoid Motion Tracking and Teleoperation with Rapid Residual Adaptation

Zhenguo Sun, Bo-Sheng Huang, Yibo Peng, Xukun Li, Jingyu Ma, Yu Sun, Zhe Li, Haojun Jiang, Biao Gao, Zhenshan Bing, Xinlong Wang, Alois Knoll

TL;DR

MOSAIC addresses the sim-to-real gap for generalist humanoid motion tracking and teleoperation by uniting a teleoperation-ready general motion tracker trained on a large, heterogeneous motion bank with a lightweight, data-efficient residual adaptor for new interfaces. The learning framework uses PPO with world-frame rewards, a two-policy scheme (GMT and ADAPT), zero-biased residual initialization, and dual-teacher distillation to inject interface-specific corrections without forgetting general abilities, all decoupled via a RobotBridge deployment layer. A multi-source data strategy with adaptive resampling ensures broad motion coverage, while experiments and real-robot tests demonstrate robust offline replay and long-horizon online teleoperation under latency and noise. The results indicate that interface-level adaptation, distilled into a residual module, yields more reliable deployment than continual fine-tuning or periodic augmentation, offering practical pathways for cross-robot and real-world demonstrations with minimal additional data.

Abstract

Generalist humanoid motion trackers have recently achieved strong simulation metrics by scaling data and training, yet often remain brittle on hardware during sustained teleoperation due to interface- and dynamics-induced errors. We present MOSAIC, an open-source, full-stack system for humanoid motion tracking and whole-body teleoperation across multiple interfaces. MOSAIC first learns a teleoperation-oriented general motion tracker via RL on a multi-source motion bank with adaptive resampling and rewards that emphasize world-frame motion consistency, which is critical for mobile teleoperation. To bridge the sim-to-real interface gap without sacrificing generality, MOSAIC then performs rapid residual adaptation: an interface-specific policy is trained using minimal interface-specific data, and then distilled into the general tracker through an additive residual module, outperforming naive fine-tuning or continual learning. We validate MOSAIC with systematic ablations, out-of-distribution benchmarking, and real-robot experiments demonstrating robust offline motion replay and online long-horizon teleoperation under realistic latency and noise.

MOSAIC: Bridging the Sim-to-Real Gap in Generalist Humanoid Motion Tracking and Teleoperation with Rapid Residual Adaptation

TL;DR

MOSAIC addresses the sim-to-real gap for generalist humanoid motion tracking and teleoperation by uniting a teleoperation-ready general motion tracker trained on a large, heterogeneous motion bank with a lightweight, data-efficient residual adaptor for new interfaces. The learning framework uses PPO with world-frame rewards, a two-policy scheme (GMT and ADAPT), zero-biased residual initialization, and dual-teacher distillation to inject interface-specific corrections without forgetting general abilities, all decoupled via a RobotBridge deployment layer. A multi-source data strategy with adaptive resampling ensures broad motion coverage, while experiments and real-robot tests demonstrate robust offline replay and long-horizon online teleoperation under latency and noise. The results indicate that interface-level adaptation, distilled into a residual module, yields more reliable deployment than continual fine-tuning or periodic augmentation, offering practical pathways for cross-robot and real-world demonstrations with minimal additional data.

Abstract

Generalist humanoid motion trackers have recently achieved strong simulation metrics by scaling data and training, yet often remain brittle on hardware during sustained teleoperation due to interface- and dynamics-induced errors. We present MOSAIC, an open-source, full-stack system for humanoid motion tracking and whole-body teleoperation across multiple interfaces. MOSAIC first learns a teleoperation-oriented general motion tracker via RL on a multi-source motion bank with adaptive resampling and rewards that emphasize world-frame motion consistency, which is critical for mobile teleoperation. To bridge the sim-to-real interface gap without sacrificing generality, MOSAIC then performs rapid residual adaptation: an interface-specific policy is trained using minimal interface-specific data, and then distilled into the general tracker through an additive residual module, outperforming naive fine-tuning or continual learning. We validate MOSAIC with systematic ablations, out-of-distribution benchmarking, and real-robot experiments demonstrating robust offline motion replay and online long-horizon teleoperation under realistic latency and noise.
Paper Structure (77 sections, 17 equations, 11 figures, 16 tables)

This paper contains 77 sections, 17 equations, 11 figures, 16 tables.

Figures (11)

  • Figure 1: MOSAIC in Action. MOSAIC enables a single humanoid policy to operate in two modes: offline motion replay (top) and online whole-body teleoperation from multiple wearable interfaces (bottom). In offline replay, the robot robustly tracks diverse and highly dynamic reference motions—walking, running, kicking, kungfu-style strikes, jumping, and squatting. In online teleoperation, MOSAIC faithfully mirrors real-time human motion streams and supports challenging contact-rich and high-agility behaviors, including mid-air jump turns, single-leg support, and jump-shot–style movements.
  • Figure 2: MOSAIC System Overview. MOSAIC consists of a unified training–deployment pipeline for humanoid motion tracking and teleoperation. Training/Simulation aggregates heterogeneous multi-source motions, two-level adaptive resampling, policy training process, yielding a deployable policy that preserves generality while improving real-robot robustness. Deployment/Real Robot supports both offline motion replay and online teleoperation. Finally, RobotBridge provides a modular interface that enables consistent evaluation and portable deployment across platforms.
  • Figure 3: Quantitative Comparison and Ablation Studies of General Motion Tracking. The radar charts illustrate five core metrics characterizing tracking fidelity, while the bar charts depict robustness in terms of Success Rate and Average Steps per Episode. Fig \ref{['fig:compare_multi_source']} compares multi-source versus single-source data distributions. Fig \ref{['fig:compare_all']} evaluates our proposed variants (Pure RL + world frame reward, Pure RL + robot frame reward, and DAgger + world frame reward) against baselines GMT chen2025gmt and TWIST ze2025twist.
  • Figure 4: Qualitative Comparison on High Dynamic Motion. From left to right are our model, TWIST, and GMT. Our model achieves substantial ground clearance at the reference apex, whereas baselines (i.e., TWIST and GMT) struggle to capture high-acceleration explosive movements.
  • Figure S1: Randomly Sampled Motions From the Motion Dataset. Each tile visualizes a motion frame extracted at random time indices across different motions, illustrating the diversity of the motion dataset.
  • ...and 6 more figures