Table of Contents
Fetching ...

Let Me Show You: Learning by Retrieving from Egocentric Video for Robotic Manipulation

Yichen Zhu, Feifei Feng

TL;DR

Robotic manipulation often suffers from data inefficiency; this work tackles it by retrieving from a bank of human demonstration videos to provide mid-level cues for learning policies. The Retrieving-from-Video (RfV) framework includes a video retriever that selects task-relevant clips from $D_{video}$ and a policy generator that ingests retrieved mid-level information—affordance masks $\alpha$ and hand trajectories $\tau$—into policy learning via cross-attention. Mid-level information is extracted offline from egocentric videos using GroundingDINO, GPT-4V, and SAM to produce $\alpha$ and $\tau$, with trajectory smoothing to ensure realism. Empirical results in Metaworld simulation and eight real-Franka tasks show that RfV outperforms several baselines, with ablations underscoring the importance of retrieval and mid-level cues and demonstrating robust generalization across spatial, distractor, and appearance variations, indicating practical viability of a retrieval-augmented robotics approach.

Abstract

Robots operating in complex and uncertain environments face considerable challenges. Advanced robotic systems often rely on extensive datasets to learn manipulation tasks. In contrast, when humans are faced with unfamiliar tasks, such as assembling a chair, a common approach is to learn by watching video demonstrations. In this paper, we propose a novel method for learning robot policies by Retrieving-from-Video (RfV), using analogies from human demonstrations to address manipulation tasks. Our system constructs a video bank comprising recordings of humans performing diverse daily tasks. To enrich the knowledge from these videos, we extract mid-level information, such as object affordance masks and hand motion trajectories, which serve as additional inputs to enhance the robot model's learning and generalization capabilities. We further feature a dual-component system: a video retriever that taps into an external video bank to fetch task-relevant video based on task specification, and a policy generator that integrates this retrieved knowledge into the learning cycle. This approach enables robots to craft adaptive responses to various scenarios and generalize to tasks beyond those in the training data. Through rigorous testing in multiple simulated and real-world settings, our system demonstrates a marked improvement in performance over conventional robotic systems, showcasing a significant breakthrough in the field of robotics.

Let Me Show You: Learning by Retrieving from Egocentric Video for Robotic Manipulation

TL;DR

Robotic manipulation often suffers from data inefficiency; this work tackles it by retrieving from a bank of human demonstration videos to provide mid-level cues for learning policies. The Retrieving-from-Video (RfV) framework includes a video retriever that selects task-relevant clips from and a policy generator that ingests retrieved mid-level information—affordance masks and hand trajectories —into policy learning via cross-attention. Mid-level information is extracted offline from egocentric videos using GroundingDINO, GPT-4V, and SAM to produce and , with trajectory smoothing to ensure realism. Empirical results in Metaworld simulation and eight real-Franka tasks show that RfV outperforms several baselines, with ablations underscoring the importance of retrieval and mid-level cues and demonstrating robust generalization across spatial, distractor, and appearance variations, indicating practical viability of a retrieval-augmented robotics approach.

Abstract

Robots operating in complex and uncertain environments face considerable challenges. Advanced robotic systems often rely on extensive datasets to learn manipulation tasks. In contrast, when humans are faced with unfamiliar tasks, such as assembling a chair, a common approach is to learn by watching video demonstrations. In this paper, we propose a novel method for learning robot policies by Retrieving-from-Video (RfV), using analogies from human demonstrations to address manipulation tasks. Our system constructs a video bank comprising recordings of humans performing diverse daily tasks. To enrich the knowledge from these videos, we extract mid-level information, such as object affordance masks and hand motion trajectories, which serve as additional inputs to enhance the robot model's learning and generalization capabilities. We further feature a dual-component system: a video retriever that taps into an external video bank to fetch task-relevant video based on task specification, and a policy generator that integrates this retrieved knowledge into the learning cycle. This approach enables robots to craft adaptive responses to various scenarios and generalize to tasks beyond those in the training data. Through rigorous testing in multiple simulated and real-world settings, our system demonstrates a marked improvement in performance over conventional robotic systems, showcasing a significant breakthrough in the field of robotics.

Paper Structure

This paper contains 10 sections, 1 equation, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Left: The level of information that we gain from robot data and video Right: The overview of our retrieving-from-video framework.
  • Figure 2: The framework of our RfV consists of three main components: the video bank (top left), the video retriever (top right), and the policy generator (bottom). The video retriever retrieves relevant videos based on language instructions, while the policy generator processes the retrieved videos and their mid-level information to facilitate the training and evaluation of the robot model.
  • Figure 3: The setup of our Franka real robot and the example of tasks in our real-world experiments.
  • Figure 4: Left: The spatial generalization experiments setup. We randomly placed the tennis ball (highlighted by red bounding box) and tennis ball box (highlighted by orange bounding box). Right: The appearance generalization. We change the color of the cube, which is not presented in the training data.