Table of Contents
Fetching ...

SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

Zhengyi Luo, Ye Yuan, Tingwu Wang, Chenran Li, Sirui Chen, Fernando Castañeda, Zi-Ang Cao, Jiefeng Li, David Minor, Qingwei Ben, Xingye Da, Runyu Ding, Cyrus Hogg, Lina Song, Edy Lim, Eugene Jeong, Tairan He, Haoru Xue, Wenli Xiao, Zi Wang, Simon Yuen, Jan Kautz, Yan Chang, Umar Iqbal, Linxi "Jim" Fan, Yuke Zhu

TL;DR

SONIC presents a scalable humanoid control framework that treats motion tracking as a foundation for general, real-time whole-body control. By scaling data (100M frames), model capacity (up to 42M parameters), and compute (128 GPUs), it learns a universal motion-tracking policy with cross-embodiment encoders and a universal token space. It couples this tracker with a generative, latent-space kinematic planner and multi-modal GENMO-based priors, enabling VR teleoperation, video/text/music control, and VLA-driven autonomous tasks, all via a single policy. Real-world experiments show strong generalization to unseen motions, robust sim-to-real transfer, and high success in VR and mobile manipulation tasks, suggesting motion tracking can serve as a practical foundation for general-purpose humanoid autonomy.

Abstract

Despite the rise of billion-parameter foundation models trained across thousands of GPUs, similar scaling gains have not been shown for humanoid control. Current neural controllers for humanoids remain modest in size, target a limited set of behaviors, and are trained on a handful of GPUs over several days. We show that scaling up model capacity, data, and compute yields a generalist humanoid controller capable of creating natural and robust whole-body movements. Specifically, we posit motion tracking as a natural and scalable task for humanoid control, leveraging dense supervision from diverse motion-capture data to acquire human motion priors without manual reward engineering. We build a foundation model for motion tracking by scaling along three axes: network size (from 1.2M to 42M parameters), dataset volume (over 100M frames, 700 hours of high-quality motion data), and compute (9k GPU hours). Beyond demonstrating the benefits of scale, we show the practical utility of our model through two mechanisms: (1) a real-time universal kinematic planner that bridges motion tracking to downstream task execution, enabling natural and interactive control, and (2) a unified token space that supports various motion input interfaces, such as VR teleoperation devices, human videos, and vision-language-action (VLA) models, all using the same policy. Scaling motion tracking exhibits favorable properties: performance improves steadily with increased compute and data diversity, and learned representations generalize to unseen motions, establishing motion tracking at scale as a practical foundation for humanoid control.

SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

TL;DR

SONIC presents a scalable humanoid control framework that treats motion tracking as a foundation for general, real-time whole-body control. By scaling data (100M frames), model capacity (up to 42M parameters), and compute (128 GPUs), it learns a universal motion-tracking policy with cross-embodiment encoders and a universal token space. It couples this tracker with a generative, latent-space kinematic planner and multi-modal GENMO-based priors, enabling VR teleoperation, video/text/music control, and VLA-driven autonomous tasks, all via a single policy. Real-world experiments show strong generalization to unseen motions, robust sim-to-real transfer, and high success in VR and mobile manipulation tasks, suggesting motion tracking can serve as a practical foundation for general-purpose humanoid autonomy.

Abstract

Despite the rise of billion-parameter foundation models trained across thousands of GPUs, similar scaling gains have not been shown for humanoid control. Current neural controllers for humanoids remain modest in size, target a limited set of behaviors, and are trained on a handful of GPUs over several days. We show that scaling up model capacity, data, and compute yields a generalist humanoid controller capable of creating natural and robust whole-body movements. Specifically, we posit motion tracking as a natural and scalable task for humanoid control, leveraging dense supervision from diverse motion-capture data to acquire human motion priors without manual reward engineering. We build a foundation model for motion tracking by scaling along three axes: network size (from 1.2M to 42M parameters), dataset volume (over 100M frames, 700 hours of high-quality motion data), and compute (9k GPU hours). Beyond demonstrating the benefits of scale, we show the practical utility of our model through two mechanisms: (1) a real-time universal kinematic planner that bridges motion tracking to downstream task execution, enabling natural and interactive control, and (2) a unified token space that supports various motion input interfaces, such as VR teleoperation devices, human videos, and vision-language-action (VLA) models, all using the same policy. Scaling motion tracking exhibits favorable properties: performance improves steadily with increased compute and data diversity, and learned representations generalize to unseen motions, establishing motion tracking at scale as a practical foundation for humanoid control.

Paper Structure

This paper contains 51 sections, 4 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: SONIC enables diverse humanoid tasks through a universal control policy that handles diverse input modalities and control interfaces.
  • Figure 2: (a-c) Effect of scaling to different sizes of dataset, model, and compute. Mean per joint position error (MPJPE) indicates motion imitation error; lower is better. For (a), we measure the dataset size in millions of frames. (d-g) Comparing our method with baselines on tracking out-of-distribution motion sequences. (d) Success rate of tracking. (e-g) Different tracking accuracy metrics are evaluated on trajectories that are successfully tracked.
  • Figure 3: Top three rows: interactive navigation switching between different velocities, directions, and styles. Bottom two rows: SONIC produces high-quality and responsive boxing motions while preserving the robot's complete freedom of movement throughout the task.
  • Figure 4: Interactive squatting, kneeling, and crawling. With SONIC, the robot can squat, kneel, and crawl at arbitrary heights, enabling seamless application in real-world downstream scenarios such as teleoperation and navigation in complex environments.
  • Figure 5: Video teleoperation, multi-modal control, and VR whole-body teleoperation.
  • ...and 3 more figures