Table of Contents
Fetching ...

6-DoF Grasp Planning using Fast 3D Reconstruction and Grasp Quality CNN

Yahav Avigal, Samuel Paradis, Harry Zhang

TL;DR

This work tackles affordable 6-DoF grasp planning for home robotics by leveraging inexpensive RGB cameras and learned multi-view depth reconstruction. It retrains the Learnt Stereo Machine (LSM) on graspable objects to produce depth maps from multiple views and combines these with a Multi-View GQ-CNN (MV-GQ-CNN) to plan robust 6-DoF grasps across viewpoints. Key contributions include a synthetic data generation pipeline for LSM retraining, an MV-GQ-CNN architecture adapted to varying camera viewpoints, and an evaluation showing feasible depth-based grasp planning with LSM-produced maps. The approach promises practical, low-cost 6-DoF grasp planning suitable for cluttered home environments and paves the way for real-robot validation and clutter-aware extensions.

Abstract

Recent consumer demand for home robots has accelerated performance of robotic grasping. However, a key component of the perception pipeline, the depth camera, is still expensive and inaccessible to most consumers. In addition, grasp planning has significantly improved recently, by leveraging large datasets and cloud robotics, and by limiting the state and action space to top-down grasps with 4 degrees of freedom (DoF). By leveraging multi-view geometry of the object using inexpensive equipment such as off-the-shelf RGB cameras and state-of-the-art algorithms such as Learn Stereo Machine (LSM\cite{kar2017learning}), the robot is able to generate more robust grasps from different angles with 6-DoF. In this paper, we present a modification of LSM to graspable objects, evaluate the grasps, and develop a 6-DoF grasp planner based on Grasp-Quality CNN (GQ-CNN\cite{mahler2017dex}) that exploits multiple camera views to plan a robust grasp, even in the absence of a possible top-down grasp.

6-DoF Grasp Planning using Fast 3D Reconstruction and Grasp Quality CNN

TL;DR

This work tackles affordable 6-DoF grasp planning for home robotics by leveraging inexpensive RGB cameras and learned multi-view depth reconstruction. It retrains the Learnt Stereo Machine (LSM) on graspable objects to produce depth maps from multiple views and combines these with a Multi-View GQ-CNN (MV-GQ-CNN) to plan robust 6-DoF grasps across viewpoints. Key contributions include a synthetic data generation pipeline for LSM retraining, an MV-GQ-CNN architecture adapted to varying camera viewpoints, and an evaluation showing feasible depth-based grasp planning with LSM-produced maps. The approach promises practical, low-cost 6-DoF grasp planning suitable for cluttered home environments and paves the way for real-robot validation and clutter-aware extensions.

Abstract

Recent consumer demand for home robots has accelerated performance of robotic grasping. However, a key component of the perception pipeline, the depth camera, is still expensive and inaccessible to most consumers. In addition, grasp planning has significantly improved recently, by leveraging large datasets and cloud robotics, and by limiting the state and action space to top-down grasps with 4 degrees of freedom (DoF). By leveraging multi-view geometry of the object using inexpensive equipment such as off-the-shelf RGB cameras and state-of-the-art algorithms such as Learn Stereo Machine (LSM\cite{kar2017learning}), the robot is able to generate more robust grasps from different angles with 6-DoF. In this paper, we present a modification of LSM to graspable objects, evaluate the grasps, and develop a 6-DoF grasp planner based on Grasp-Quality CNN (GQ-CNN\cite{mahler2017dex}) that exploits multiple camera views to plan a robust grasp, even in the absence of a possible top-down grasp.

Paper Structure

This paper contains 16 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Scene to Grasp Pipeline: Starting with some graspable object, we first render multiple views of the scene from several cameras. Next, we use these renderings as input to an specially-trained LSM, which outputs depth maps. These depth maps are fed into a multi-view GQ-CNN, which generates the optimal grasp across all views.
  • Figure 2: Pretrained Reconstruction Error: Pixel-wise difference between the LSM-predicted depth map and the ground truth depth map shows poor-quality prediction of a graspable bottle's depth image.
  • Figure 3: Data Generation Illustrations: Left: the original bottle object 3D mesh. Middle: renderings of 3 RGB images from 3 different angles. Right: renderings of the 3 corresponding depth maps.
  • Figure 4: Multi-View Grasping: Cameras are located in different locations in a hemisphere around the target object. A grasp is planned from each view, and the grasp with the highest success probability is chosen and applied.
  • Figure 5: Unseen Objects: Difference between ground truth and predicted depth map. Top half contain unseen bottles, bottom half contains objects from unseen categories. Our network has better performance on both types of unseen objects, as these graspable objects better match our training distribution. From top to bottom: shampoo bottle, waffleroll, sitting cat, NutellaGo package.
  • ...and 2 more figures