Table of Contents
Fetching ...

Slot-Level Robotic Placement via Visual Imitation from Single Human Video

Dandan Shan, Kaichun Mo, Wei Yang, Yu-Wei Chao, David Fouhey, Dieter Fox, Arsalan Mousavian

TL;DR

This work tackles slot-level robotic placement by learning from a single human demonstration video. It introduces SLeRP, a modular system that uses Slot-Net to detect the placement slot and correlates human and robot views to compute $T_i \in SE(3)$ for each slot, enabling precise 3D placement. A novel data-augmentation pipeline plus a real-world 288-video benchmark demonstrates that SLeRP outperforms baselines and works on real robots. The approach significantly reduces training data requirements for new slot-level tasks and shows strong generalization to unseen objects and scenes in real-world robotics applications.

Abstract

The majority of modern robot learning methods focus on learning a set of pre-defined tasks with limited or no generalization to new tasks. Extending the robot skillset to novel tasks involves gathering an extensive amount of training data for additional tasks. In this paper, we address the problem of teaching new tasks to robots using human demonstration videos for repetitive tasks (e.g., packing). This task requires understanding the human video to identify which object is being manipulated (the pick object) and where it is being placed (the placement slot). In addition, it needs to re-identify the pick object and the placement slots during inference along with the relative poses to enable robot execution of the task. To tackle this, we propose SLeRP, a modular system that leverages several advanced visual foundation models and a novel slot-level placement detector Slot-Net, eliminating the need for expensive video demonstrations for training. We evaluate our system using a new benchmark of real-world videos. The evaluation results show that SLeRP outperforms several baselines and can be deployed on a real robot.

Slot-Level Robotic Placement via Visual Imitation from Single Human Video

TL;DR

This work tackles slot-level robotic placement by learning from a single human demonstration video. It introduces SLeRP, a modular system that uses Slot-Net to detect the placement slot and correlates human and robot views to compute for each slot, enabling precise 3D placement. A novel data-augmentation pipeline plus a real-world 288-video benchmark demonstrates that SLeRP outperforms baselines and works on real robots. The approach significantly reduces training data requirements for new slot-level tasks and shows strong generalization to unseen objects and scenes in real-world robotics applications.

Abstract

The majority of modern robot learning methods focus on learning a set of pre-defined tasks with limited or no generalization to new tasks. Extending the robot skillset to novel tasks involves gathering an extensive amount of training data for additional tasks. In this paper, we address the problem of teaching new tasks to robots using human demonstration videos for repetitive tasks (e.g., packing). This task requires understanding the human video to identify which object is being manipulated (the pick object) and where it is being placed (the placement slot). In addition, it needs to re-identify the pick object and the placement slots during inference along with the relative poses to enable robot execution of the task. To tackle this, we propose SLeRP, a modular system that leverages several advanced visual foundation models and a novel slot-level placement detector Slot-Net, eliminating the need for expensive video demonstrations for training. We evaluate our system using a new benchmark of real-world videos. The evaluation results show that SLeRP outperforms several baselines and can be deployed on a real robot.

Paper Structure

This paper contains 14 sections, 1 equation, 8 figures, 3 tables.

Figures (8)

  • Figure 1: We introduce the novel problem of imitating slot-level robotic placement from a single human video. Given a human demonstration video showing an object being placed in a slot, and a new robot-view image captured by the robot wrist camera (may feature varied camera and object poses, changed scenes), SLeRP is able to find the corresponding object and similar slots in the robot view, and provide the 6-DoF transformation matrix for each detected slot to guide the robot in placing the object accurately.
  • Figure 2: Method Overview. The system begins by analyzing the input human video, tracking the object (highlighted in yellow) throughout the sequence and identifying the placement slot (highlighted in red). Next, we re-identify the object and the slot in the robot's view by correlating the human-view and robot-view images. Using depth images, we reconstruct the observations in 3D and compute a single 6-DoF object transformation $T$ in the robot's view, enabling the robot to transfer the object into the slot. If more than one slot is present, we detect all applicable slots and compute one 6-DoF object transformation for each slot. Finally, such 6-DoF object transformations are sent to the downstream robot planning and control pipeline for real robot pick-and-place execution.
  • Figure 3: Parse Human Video. Given the input human video (bottom), we run state-of-the-art hand-object detector (yellow) and tracker (blue) to obtain the pick object mask (yellow) and train a novel network Slot-Net (red) to identify the slot mask (red).
  • Figure 4: Slot-Net Data Generation. Given an object-centric image (top middle), we inpaint to remove an object and reveal its slot (top left) and manually annotate the slot mask (top right). We then outpaint these images with a scene background (bottom) to create a starting and end image pair with a ground-truth slot mask.
  • Figure 5: Correlate with robot view. Given the object and slot mask detected in the human video, we first re-identify the corresponding object and slot in robot view, and also find all similar empty slots. With corresponding object masks and slot masks, we first compute 2D keypoint matching among the detected object and mask local patches and then lift the observations to 3D to compute 6-DoF transforms.
  • ...and 3 more figures