MOSAIC: Bridging the Sim-to-Real Gap in Generalist Humanoid Motion Tracking and Teleoperation with Rapid Residual Adaptation
Zhenguo Sun, Bo-Sheng Huang, Yibo Peng, Xukun Li, Jingyu Ma, Yu Sun, Zhe Li, Haojun Jiang, Biao Gao, Zhenshan Bing, Xinlong Wang, Alois Knoll
TL;DR
MOSAIC addresses the sim-to-real gap for generalist humanoid motion tracking and teleoperation by uniting a teleoperation-ready general motion tracker trained on a large, heterogeneous motion bank with a lightweight, data-efficient residual adaptor for new interfaces. The learning framework uses PPO with world-frame rewards, a two-policy scheme (GMT and ADAPT), zero-biased residual initialization, and dual-teacher distillation to inject interface-specific corrections without forgetting general abilities, all decoupled via a RobotBridge deployment layer. A multi-source data strategy with adaptive resampling ensures broad motion coverage, while experiments and real-robot tests demonstrate robust offline replay and long-horizon online teleoperation under latency and noise. The results indicate that interface-level adaptation, distilled into a residual module, yields more reliable deployment than continual fine-tuning or periodic augmentation, offering practical pathways for cross-robot and real-world demonstrations with minimal additional data.
Abstract
Generalist humanoid motion trackers have recently achieved strong simulation metrics by scaling data and training, yet often remain brittle on hardware during sustained teleoperation due to interface- and dynamics-induced errors. We present MOSAIC, an open-source, full-stack system for humanoid motion tracking and whole-body teleoperation across multiple interfaces. MOSAIC first learns a teleoperation-oriented general motion tracker via RL on a multi-source motion bank with adaptive resampling and rewards that emphasize world-frame motion consistency, which is critical for mobile teleoperation. To bridge the sim-to-real interface gap without sacrificing generality, MOSAIC then performs rapid residual adaptation: an interface-specific policy is trained using minimal interface-specific data, and then distilled into the general tracker through an additive residual module, outperforming naive fine-tuning or continual learning. We validate MOSAIC with systematic ablations, out-of-distribution benchmarking, and real-robot experiments demonstrating robust offline motion replay and online long-horizon teleoperation under realistic latency and noise.
