Table of Contents
Fetching ...

EgoScale: Scaling Dexterous Manipulation with Diverse Egocentric Human Data

Ruijie Zheng, Dantong Niu, Yuqi Xie, Jing Wang, Mengda Xu, Yunfan Jiang, Fernando Castañeda, Fengyuan Hu, You Liang Tan, Letian Fu, Trevor Darrell, Furong Huang, Yuke Zhu, Danfei Xu, Linxi Fan

TL;DR

This work investigates whether large-scale egocentric human data can drive high-DoF dexterous robot manipulation. It introduces EgoScale, a two-stage transfer framework that pretrains a flow-based Vision-Language-Action policy on $20{,}854$ hours of human data and then aligns it with robot sensing through a lightweight mid-training phase, enabling effective post-training adaptation. A log-linear relationship $L = 0.024 - 0.003 \

Abstract

Human behavior is among the most scalable sources of data for learning physical intelligence, yet how to effectively leverage it for dexterous manipulation remains unclear. While prior work demonstrates human to robot transfer in constrained settings, it is unclear whether large scale human data can support fine grained, high degree of freedom dexterous manipulation. We present EgoScale, a human to dexterous manipulation transfer framework built on large scale egocentric human data. We train a Vision Language Action (VLA) model on over 20,854 hours of action labeled egocentric human video, more than 20 times larger than prior efforts, and uncover a log linear scaling law between human data scale and validation loss. This validation loss strongly correlates with downstream real robot performance, establishing large scale human data as a predictable supervision source. Beyond scale, we introduce a simple two stage transfer recipe: large scale human pretraining followed by lightweight aligned human robot mid training. This enables strong long horizon dexterous manipulation and one shot task adaptation with minimal robot supervision. Our final policy improves average success rate by 54% over a no pretraining baseline using a 22 DoF dexterous robotic hand, and transfers effectively to robots with lower DoF hands, indicating that large scale human motion provides a reusable, embodiment agnostic motor prior.

EgoScale: Scaling Dexterous Manipulation with Diverse Egocentric Human Data

TL;DR

This work investigates whether large-scale egocentric human data can drive high-DoF dexterous robot manipulation. It introduces EgoScale, a two-stage transfer framework that pretrains a flow-based Vision-Language-Action policy on hours of human data and then aligns it with robot sensing through a lightweight mid-training phase, enabling effective post-training adaptation. A log-linear relationship $L = 0.024 - 0.003 \

Abstract

Human behavior is among the most scalable sources of data for learning physical intelligence, yet how to effectively leverage it for dexterous manipulation remains unclear. While prior work demonstrates human to robot transfer in constrained settings, it is unclear whether large scale human data can support fine grained, high degree of freedom dexterous manipulation. We present EgoScale, a human to dexterous manipulation transfer framework built on large scale egocentric human data. We train a Vision Language Action (VLA) model on over 20,854 hours of action labeled egocentric human video, more than 20 times larger than prior efforts, and uncover a log linear scaling law between human data scale and validation loss. This validation loss strongly correlates with downstream real robot performance, establishing large scale human data as a predictable supervision source. Beyond scale, we introduce a simple two stage transfer recipe: large scale human pretraining followed by lightweight aligned human robot mid training. This enables strong long horizon dexterous manipulation and one shot task adaptation with minimal robot supervision. Our final policy improves average success rate by 54% over a no pretraining baseline using a 22 DoF dexterous robotic hand, and transfers effectively to robots with lower DoF hands, indicating that large scale human motion provides a reusable, embodiment agnostic motor prior.
Paper Structure (31 sections, 1 equation, 10 figures)

This paper contains 31 sections, 1 equation, 10 figures.

Figures (10)

  • Figure 1: EgoScale: Two-stage human-to-robot learning framework. A flow-based Vision-Language-Action (VLA) policy is first pretrained on 20,854 hours of egocentric human videos using wrist motion and retargeted dexterous hand actions. A lightweight mid-training stage with aligned human robot play data (pairs highlighted with green and gray boundaries) adapts the representation to robot sensing and control. The resulting policy is post-trained on downstream tasks, enabling efficient learning of dexterous manipulation and one-shot generalization to unseen skills.
  • Figure 2: Human Data Collection and Model Architecture. (Left) Aligned human-robot mid-training data are collected using the same sensing setup as the robot. Vive trackers and Manus gloves capture arm and hand motion, while one head-mounted camera and two wrist-mounted cameras record egocentric and wrist views, enabling consistent perception–action alignment. (Right) A flow-based VLA policy with a VLM backbone and DiT action expert. Human and robot data are unified through a wrist-level action representation, with lightweight embodiment-specific adapters for proprioception and hand actions.
  • Figure 3: Post-Training Evaluation Tasks. Five dexterous manipulation tasks used to evaluate post-training performance
  • Figure 4: Main Experimental Results. Comparison of Human Pre-train + Mid-Training, Human Pretraining, and No Pretraining across five dexterous manipulation tasks under two evaluation metrics.
  • Figure 5: Scaling behavior of human pretraining.Left: Human validation loss versus training steps for models pretrained with increasing amounts of egocentric human data (1k–20k hours). Larger datasets yield stable, monotonic improvements, while smaller datasets exhibit early overfitting. Center: Optimal validation loss at convergence as a function of human data scale, revealing a near-perfect log-linear scaling law ($R^2=0.9983$). Right: Downstream robot performance after post-training, measured by average task completion score, improves consistently with increased human data scale. Together, these results demonstrate predictable scaling of learned action representations and their direct translation to improved dexterous manipulation performance.
  • ...and 5 more figures