Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers
Vidhi Jain, Maria Attarian, Nikhil J Joshi, Ayzaan Wahid, Danny Driess, Quan Vuong, Pannag R Sanketi, Pierre Sermanet, Stefan Welker, Christine Chan, Igor Gilitschenski, Yonatan Bisk, Debidatta Dwibedi
TL;DR
Vid2Robot tackles learning manipulation from video demonstrations by conditioning a robot policy on a prompt video and the robot's current state, using cross-attention transformers to fuse modalities. It introduces a four-module architecture (prompt video encoder, state encoder, state-prompt cross-attention, and action decoder) trained with a combination of behavior cloning and four auxiliary losses to align representations across embodiments. Real-robot experiments show Vid2Robot significantly outperforms a strong video-conditioned baseline when prompted by human videos and demonstrates cross-object motion transfer, enabling motion demonstrated on one object to generalize to others. The work provides a scalable dataset of paired prompt videos and robot trajectories, a learnable end-to-end policy for rapid skill adaptation, and insights into the benefits and limits of video-based task specification for robotics.
Abstract
Large-scale multi-task robotic manipulation systems often rely on text to specify the task. In this work, we explore whether a robot can learn by observing humans. To do so, the robot must understand a person's intent and perform the inferred task despite differences in the embodiments and environments. We introduce Vid2Robot, an end-to-end video-conditioned policy that takes human videos demonstrating manipulation tasks as input and produces robot actions. Our model is trained with a large dataset of prompt video-robot trajectory pairs to learn unified representations of human and robot actions from videos. Vid2Robot uses cross-attention transformer layers between video features and the current robot state to produce the actions and perform the same task as shown in the video. We use auxiliary contrastive losses to align the prompt and robot video representations for better policies. We evaluate Vid2Robot on real-world robots and observe over 20% improvement over BC-Z when using human prompt videos. Further, we also show cross-object motion transfer ability that enables video-conditioned policies to transfer a motion observed on one object in the prompt video to another object in the robot's own environment. Videos available at https://vid2robot.github.io
