SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

Zhengyi Luo; Ye Yuan; Tingwu Wang; Chenran Li; Sirui Chen; Fernando Castañeda; Zi-Ang Cao; Jiefeng Li; David Minor; Qingwei Ben; Xingye Da; Runyu Ding; Cyrus Hogg; Lina Song; Edy Lim; Eugene Jeong; Tairan He; Haoru Xue; Wenli Xiao; Zi Wang; Simon Yuen; Jan Kautz; Yan Chang; Umar Iqbal; Linxi "Jim" Fan; Yuke Zhu

SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

Zhengyi Luo, Ye Yuan, Tingwu Wang, Chenran Li, Sirui Chen, Fernando Castañeda, Zi-Ang Cao, Jiefeng Li, David Minor, Qingwei Ben, Xingye Da, Runyu Ding, Cyrus Hogg, Lina Song, Edy Lim, Eugene Jeong, Tairan He, Haoru Xue, Wenli Xiao, Zi Wang, Simon Yuen, Jan Kautz, Yan Chang, Umar Iqbal, Linxi "Jim" Fan, Yuke Zhu

TL;DR

SONIC presents a scalable humanoid control framework that treats motion tracking as a foundation for general, real-time whole-body control. By scaling data (100M frames), model capacity (up to 42M parameters), and compute (128 GPUs), it learns a universal motion-tracking policy with cross-embodiment encoders and a universal token space. It couples this tracker with a generative, latent-space kinematic planner and multi-modal GENMO-based priors, enabling VR teleoperation, video/text/music control, and VLA-driven autonomous tasks, all via a single policy. Real-world experiments show strong generalization to unseen motions, robust sim-to-real transfer, and high success in VR and mobile manipulation tasks, suggesting motion tracking can serve as a practical foundation for general-purpose humanoid autonomy.

Abstract

Despite the rise of billion-parameter foundation models trained across thousands of GPUs, similar scaling gains have not been shown for humanoid control. Current neural controllers for humanoids remain modest in size, target a limited set of behaviors, and are trained on a handful of GPUs over several days. We show that scaling up model capacity, data, and compute yields a generalist humanoid controller capable of creating natural and robust whole-body movements. Specifically, we posit motion tracking as a natural and scalable task for humanoid control, leveraging dense supervision from diverse motion-capture data to acquire human motion priors without manual reward engineering. We build a foundation model for motion tracking by scaling along three axes: network size (from 1.2M to 42M parameters), dataset volume (over 100M frames, 700 hours of high-quality motion data), and compute (9k GPU hours). Beyond demonstrating the benefits of scale, we show the practical utility of our model through two mechanisms: (1) a real-time universal kinematic planner that bridges motion tracking to downstream task execution, enabling natural and interactive control, and (2) a unified token space that supports various motion input interfaces, such as VR teleoperation devices, human videos, and vision-language-action (VLA) models, all using the same policy. Scaling motion tracking exhibits favorable properties: performance improves steadily with increased compute and data diversity, and learned representations generalize to unseen motions, establishing motion tracking at scale as a practical foundation for humanoid control.

SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

TL;DR

Abstract

SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)