EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos

Ruihan Yang; Qinxi Yu; Yecheng Wu; Rui Yan; Borui Li; An-Chieh Cheng; Xueyan Zou; Yunhao Fang; Xuxin Cheng; Ri-Zhao Qiu; Hongxu Yin; Sifei Liu; Song Han; Yao Lu; Xiaolong Wang

EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos

Ruihan Yang, Qinxi Yu, Yecheng Wu, Rui Yan, Borui Li, An-Chieh Cheng, Xueyan Zou, Yunhao Fang, Xuxin Cheng, Ri-Zhao Qiu, Hongxu Yin, Sifei Liu, Song Han, Yao Lu, Xiaolong Wang

TL;DR

EgoVLA proposes a vision-language-action framework trained on large-scale egocentric human videos to learn dexterous manipulation and transfer policies to a bimanual humanoid robot. By unifying the human and robot hand representations around MANO and applying IK-based retargeting, EgoVLA achieves transferable control with modest robot demonstrations. The work introduces Ego Humanoid Manipulation Benchmark in NVIDIA IsaacSim, enabling reproducible evaluation across short- and long-horizon tasks and varied visuals. Empirical results show that human-video pretraining substantially improves in-domain and out-of-domain performance, while ablations highlight the necessity of robot demonstrations for reliable deployment.

Abstract

Real robot data collection for imitation learning has led to significant advancements in robotic manipulation. However, the requirement for robot hardware in the process fundamentally constrains the scale of the data. In this paper, we explore training Vision-Language-Action (VLA) models using egocentric human videos. The benefit of using human videos is not only for their scale but more importantly for the richness of scenes and tasks. With a VLA trained on human video that predicts human wrist and hand actions, we can perform Inverse Kinematics and retargeting to convert the human actions to robot actions. We fine-tune the model using a few robot manipulation demonstrations to obtain the robot policy, namely EgoVLA. We propose a simulation benchmark called Ego Humanoid Manipulation Benchmark, where we design diverse bimanual manipulation tasks with demonstrations. We fine-tune and evaluate EgoVLA with Ego Humanoid Manipulation Benchmark and show significant improvements over baselines and ablate the importance of human data. Videos can be found on our website: https://rchalyang.github.io/EgoVLA

EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos

TL;DR

Abstract

EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)