Table of Contents
Fetching ...

Self-supervised 6-DoF Robot Grasping by Demonstration via Augmented Reality Teleoperation System

Xiwen Dengxiong, Xueting Wang, Shi Bai, Yunbo Zhang

TL;DR

This work tackles unknown-object grasp pose detection for $6$-DoF grasping under restricted environments where grasp pose annotations are impractical. It introduces a self-supervised framework that leverages an AR teleoperation system to collect human demonstrations and learn a contrastive point-cloud representation, enabling $6$-DoF grasp poses without explicit grasp labels. A key contribution is the demonstration learning module that maps morphology-based demonstrations to pose adjustments, yielding accurate grasps after only a few demonstrations. Real-world experiments show the approach improves grasp success on unseen objects and reduces annotation burden, with sub-second planning and teleoperation latency, making it practical for remote or hazardous settings.

Abstract

Most existing 6-DoF robot grasping solutions depend on strong supervision on grasp pose to ensure satisfactory performance, which could be laborious and impractical when the robot works in some restricted area. To this end, we propose a self-supervised 6-DoF grasp pose detection framework via an Augmented Reality (AR) teleoperation system that can efficiently learn human demonstrations and provide 6-DoF grasp poses without grasp pose annotations. Specifically, the system collects the human demonstration from the AR environment and contrastively learns the grasping strategy from the demonstration. For the real-world experiment, the proposed system leads to satisfactory grasping abilities and learning to grasp unknown objects within three demonstrations.

Self-supervised 6-DoF Robot Grasping by Demonstration via Augmented Reality Teleoperation System

TL;DR

This work tackles unknown-object grasp pose detection for -DoF grasping under restricted environments where grasp pose annotations are impractical. It introduces a self-supervised framework that leverages an AR teleoperation system to collect human demonstrations and learn a contrastive point-cloud representation, enabling -DoF grasp poses without explicit grasp labels. A key contribution is the demonstration learning module that maps morphology-based demonstrations to pose adjustments, yielding accurate grasps after only a few demonstrations. Real-world experiments show the approach improves grasp success on unseen objects and reduces annotation burden, with sub-second planning and teleoperation latency, making it practical for remote or hazardous settings.

Abstract

Most existing 6-DoF robot grasping solutions depend on strong supervision on grasp pose to ensure satisfactory performance, which could be laborious and impractical when the robot works in some restricted area. To this end, we propose a self-supervised 6-DoF grasp pose detection framework via an Augmented Reality (AR) teleoperation system that can efficiently learn human demonstrations and provide 6-DoF grasp poses without grasp pose annotations. Specifically, the system collects the human demonstration from the AR environment and contrastively learns the grasping strategy from the demonstration. For the real-world experiment, the proposed system leads to satisfactory grasping abilities and learning to grasp unknown objects within three demonstrations.
Paper Structure (22 sections, 1 equation, 4 figures, 3 tables)

This paper contains 22 sections, 1 equation, 4 figures, 3 tables.

Figures (4)

  • Figure 1: 6-DoF grasp pose for unknown objects. Red gripper shows the grasp pose. The first row illustrates the generated grasp poses without human demonstration. The second row shows the final grasp pose after learning demonstrations.
  • Figure 2: System Overview. Section A illustrates the system design. The user camera is used to render the remote environment in the AR display. Users may use the AR software to control the remote robot. The robot camera can collect RGB-D images for robot grasping. Section B shows the architecture of the remote robot server, including Robot Control, grasp pose control, and demonstration learning.
  • Figure 3: The contrastive point cloud learning model first augments the input and extracts features through projector $g(f(\cdot))$.
  • Figure 4: Example objects in hardware, household, food, and toys.