Table of Contents
Fetching ...

EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos

Ruihan Yang, Qinxi Yu, Yecheng Wu, Rui Yan, Borui Li, An-Chieh Cheng, Xueyan Zou, Yunhao Fang, Xuxin Cheng, Ri-Zhao Qiu, Hongxu Yin, Sifei Liu, Song Han, Yao Lu, Xiaolong Wang

TL;DR

EgoVLA proposes a vision-language-action framework trained on large-scale egocentric human videos to learn dexterous manipulation and transfer policies to a bimanual humanoid robot. By unifying the human and robot hand representations around MANO and applying IK-based retargeting, EgoVLA achieves transferable control with modest robot demonstrations. The work introduces Ego Humanoid Manipulation Benchmark in NVIDIA IsaacSim, enabling reproducible evaluation across short- and long-horizon tasks and varied visuals. Empirical results show that human-video pretraining substantially improves in-domain and out-of-domain performance, while ablations highlight the necessity of robot demonstrations for reliable deployment.

Abstract

Real robot data collection for imitation learning has led to significant advancements in robotic manipulation. However, the requirement for robot hardware in the process fundamentally constrains the scale of the data. In this paper, we explore training Vision-Language-Action (VLA) models using egocentric human videos. The benefit of using human videos is not only for their scale but more importantly for the richness of scenes and tasks. With a VLA trained on human video that predicts human wrist and hand actions, we can perform Inverse Kinematics and retargeting to convert the human actions to robot actions. We fine-tune the model using a few robot manipulation demonstrations to obtain the robot policy, namely EgoVLA. We propose a simulation benchmark called Ego Humanoid Manipulation Benchmark, where we design diverse bimanual manipulation tasks with demonstrations. We fine-tune and evaluate EgoVLA with Ego Humanoid Manipulation Benchmark and show significant improvements over baselines and ablate the importance of human data. Videos can be found on our website: https://rchalyang.github.io/EgoVLA

EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos

TL;DR

EgoVLA proposes a vision-language-action framework trained on large-scale egocentric human videos to learn dexterous manipulation and transfer policies to a bimanual humanoid robot. By unifying the human and robot hand representations around MANO and applying IK-based retargeting, EgoVLA achieves transferable control with modest robot demonstrations. The work introduces Ego Humanoid Manipulation Benchmark in NVIDIA IsaacSim, enabling reproducible evaluation across short- and long-horizon tasks and varied visuals. Empirical results show that human-video pretraining substantially improves in-domain and out-of-domain performance, while ablations highlight the necessity of robot demonstrations for reliable deployment.

Abstract

Real robot data collection for imitation learning has led to significant advancements in robotic manipulation. However, the requirement for robot hardware in the process fundamentally constrains the scale of the data. In this paper, we explore training Vision-Language-Action (VLA) models using egocentric human videos. The benefit of using human videos is not only for their scale but more importantly for the richness of scenes and tasks. With a VLA trained on human video that predicts human wrist and hand actions, we can perform Inverse Kinematics and retargeting to convert the human actions to robot actions. We fine-tune the model using a few robot manipulation demonstrations to obtain the robot policy, namely EgoVLA. We propose a simulation benchmark called Ego Humanoid Manipulation Benchmark, where we design diverse bimanual manipulation tasks with demonstrations. We fine-tune and evaluate EgoVLA with Ego Humanoid Manipulation Benchmark and show significant improvements over baselines and ablate the importance of human data. Videos can be found on our website: https://rchalyang.github.io/EgoVLA

Paper Structure

This paper contains 29 sections, 2 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: EgoVLA. Our vision-language-action model learns manipulation skills from egocentric human videos and transfers them to a bimanual humanoid robot. The top row illustrates the diverse manipulation behaviors demonstrated by humans in the video dataset, while the bottom row shows the robot performing egocentric dexterous manipulation based on the learned skills.
  • Figure 2: EgoVLA takes visual history, language instruction, and action query token as input. The latent features are converted to human action with the action head. We use the wrist pose and MANO hand parameterMANO:SIGGRAPHASIA:2017 as human action space.
  • Figure 3: Human Data
  • Figure 4: Unified Action Space: MANO hand parameters are used as a shared action space for humans and robots. For robot hands, during training, optimized mano parameters produce the same fingertip position as the robot hand fingertip. A small MLP maps predicted finger tip positions to joint commands during deployment.
  • Figure 5: Task Visualization. All simulated tasks with predicted wrist trajs from EgoVLA.
  • ...and 11 more figures