Table of Contents
Fetching ...

Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers

Vidhi Jain, Maria Attarian, Nikhil J Joshi, Ayzaan Wahid, Danny Driess, Quan Vuong, Pannag R Sanketi, Pierre Sermanet, Stefan Welker, Christine Chan, Igor Gilitschenski, Yonatan Bisk, Debidatta Dwibedi

TL;DR

Vid2Robot tackles learning manipulation from video demonstrations by conditioning a robot policy on a prompt video and the robot's current state, using cross-attention transformers to fuse modalities. It introduces a four-module architecture (prompt video encoder, state encoder, state-prompt cross-attention, and action decoder) trained with a combination of behavior cloning and four auxiliary losses to align representations across embodiments. Real-robot experiments show Vid2Robot significantly outperforms a strong video-conditioned baseline when prompted by human videos and demonstrates cross-object motion transfer, enabling motion demonstrated on one object to generalize to others. The work provides a scalable dataset of paired prompt videos and robot trajectories, a learnable end-to-end policy for rapid skill adaptation, and insights into the benefits and limits of video-based task specification for robotics.

Abstract

Large-scale multi-task robotic manipulation systems often rely on text to specify the task. In this work, we explore whether a robot can learn by observing humans. To do so, the robot must understand a person's intent and perform the inferred task despite differences in the embodiments and environments. We introduce Vid2Robot, an end-to-end video-conditioned policy that takes human videos demonstrating manipulation tasks as input and produces robot actions. Our model is trained with a large dataset of prompt video-robot trajectory pairs to learn unified representations of human and robot actions from videos. Vid2Robot uses cross-attention transformer layers between video features and the current robot state to produce the actions and perform the same task as shown in the video. We use auxiliary contrastive losses to align the prompt and robot video representations for better policies. We evaluate Vid2Robot on real-world robots and observe over 20% improvement over BC-Z when using human prompt videos. Further, we also show cross-object motion transfer ability that enables video-conditioned policies to transfer a motion observed on one object in the prompt video to another object in the robot's own environment. Videos available at https://vid2robot.github.io

Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers

TL;DR

Vid2Robot tackles learning manipulation from video demonstrations by conditioning a robot policy on a prompt video and the robot's current state, using cross-attention transformers to fuse modalities. It introduces a four-module architecture (prompt video encoder, state encoder, state-prompt cross-attention, and action decoder) trained with a combination of behavior cloning and four auxiliary losses to align representations across embodiments. Real-robot experiments show Vid2Robot significantly outperforms a strong video-conditioned baseline when prompted by human videos and demonstrates cross-object motion transfer, enabling motion demonstrated on one object to generalize to others. The work provides a scalable dataset of paired prompt videos and robot trajectories, a learnable end-to-end policy for rapid skill adaptation, and insights into the benefits and limits of video-based task specification for robotics.

Abstract

Large-scale multi-task robotic manipulation systems often rely on text to specify the task. In this work, we explore whether a robot can learn by observing humans. To do so, the robot must understand a person's intent and perform the inferred task despite differences in the embodiments and environments. We introduce Vid2Robot, an end-to-end video-conditioned policy that takes human videos demonstrating manipulation tasks as input and produces robot actions. Our model is trained with a large dataset of prompt video-robot trajectory pairs to learn unified representations of human and robot actions from videos. Vid2Robot uses cross-attention transformer layers between video features and the current robot state to produce the actions and perform the same task as shown in the video. We use auxiliary contrastive losses to align the prompt and robot video representations for better policies. We evaluate Vid2Robot on real-world robots and observe over 20% improvement over BC-Z when using human prompt videos. Further, we also show cross-object motion transfer ability that enables video-conditioned policies to transfer a motion observed on one object in the prompt video to another object in the robot's own environment. Videos available at https://vid2robot.github.io
Paper Structure (28 sections, 4 equations, 11 figures, 6 tables)

This paper contains 28 sections, 4 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Overview. Vid2Robot is a video-conditioned robot policy. Given a human demonstration (top), Vid2Robot recognizes the task semantics and performs the same task based on the robot's current visual observation (bottom left). A successful trajectory is presented on the bottom right.
  • Figure 2: Dataset creation. (top row) Here we show a Robot-Robot video pair for placing the rxbar into top drawer. We similarly pair existing robot-robot videos performing the same task. (middle row) Here, we show Hindsight Human-Robot paired videos for picking a Coke can from the bottom drawer and placing it on the counter task. We use the task instructions from robot trajectories, ask human participants to perform the task and record a demonstration video from the robot's perspective/view. (bottom row) Here, we show a Co-located Human-Robot pair of videos for placing the pipe wrench in the toolkit. We record a human demonstration and a robot teleoperation in the same workspace. We use different workspaces to perform the same task instruction, thus collecting paired videos with visually diverse prompts and robot state observations. More details in Section \ref{['subsec:datasets']}.
  • Figure 3: Architecture. Our model takes the prompt video and the robot's current observations as the input, encodes those into token embeddings for the prompt video and the robot's state, cross-attends to produce state-prompt encoding, and translates it into the expected robot action at the current timestep. More details in Section \ref{['subsec:architecture']}).
  • Figure 4: Training Setup. We show all the losses used for training Vid2Robot, particularly how each loss connects to its different modules. Along with (1) the main action prediction loss, we apply three auxiliary losses: (2) temporal video alignment loss, (3) a contrastive loss between the prompt and robot video performing the same task, and (4) a contrastive loss between a prompt/robot video with the language embedding. More details are in Section \ref{['subsec:training']}.
  • Figure 5: Policy Rollouts. Each row shows a prompt video of a human doing a task on the left, and on the right, we show the corresponding successful robot rollouts using Vid2Robot. Note how visually different the prompts are, while the policy rollouts have different lighting and backgrounds, as well as the number and placement of the distractor objects.
  • ...and 6 more figures