Learning Any-View 6DoF Robotic Grasping in Cluttered Scenes via Neural Surface Rendering

Snehal Jauhri; Ishikaa Lunawat; Georgia Chalvatzaki

Learning Any-View 6DoF Robotic Grasping in Cluttered Scenes via Neural Surface Rendering

Snehal Jauhri, Ishikaa Lunawat, Georgia Chalvatzaki

TL;DR

This work reinterprets grasping as rendering and introduces NeuGraspNet, a novel method for 6DoF grasp detection that leverages advances in neural volumetric representations and surface rendering.

Abstract

A significant challenge for real-world robotic manipulation is the effective 6DoF grasping of objects in cluttered scenes from any single viewpoint without the need for additional scene exploration. This work reinterprets grasping as rendering and introduces NeuGraspNet, a novel method for 6DoF grasp detection that leverages advances in neural volumetric representations and surface rendering. It encodes the interaction between a robot's end-effector and an object's surface by jointly learning to render the local object surface and learning grasping functions in a shared feature space. The approach uses global (scene-level) features for grasp generation and local (grasp-level) neural surface features for grasp evaluation. This enables effective, fully implicit 6DoF grasp quality prediction, even in partially observed scenes. NeuGraspNet operates on random viewpoints, common in mobile manipulation scenarios, and outperforms existing implicit and semi-implicit grasping methods. The real-world applicability of the method has been demonstrated with a mobile manipulator robot, grasping in open, cluttered spaces. Project website at https://sites.google.com/view/neugraspnet

Learning Any-View 6DoF Robotic Grasping in Cluttered Scenes via Neural Surface Rendering

TL;DR

This work reinterprets grasping as rendering and introduces NeuGraspNet, a novel method for 6DoF grasp detection that leverages advances in neural volumetric representations and surface rendering.

Abstract

Paper Structure (29 sections, 5 equations, 9 figures, 9 tables)

This paper contains 29 sections, 5 equations, 9 figures, 9 tables.

Introduction
Background & Related Work
Neural implicit scene representations
Grasp detection
Learning 6DoF Grasping via Neural Surface Rendering
Neural scene reconstruction
Scene-level rendering & grasp candidate generation
Local surface rendering & learning grasping functions
Implementation
Experiments
Grasp Quality Prediction
Comparison with baselines on VGN Breyer2020 scenes
Ablation study
Comparison with baselines on EGAD morrison2020egad
Grasp Affordance Prediction
...and 14 more sections

Figures (9)

Figure 1: NeuGraspNet: A single-view 3D Truncated Signed Distance Field (TSDF) grid is processed through a convolutional occupancy network to reconstruct the scene (cf. \ref{['subsec:scene']}). The occupancy network is used to perform global, scene-level rendering. The rendered scene is used for grasp candidate generation in SE(3) (cf. \ref{['subsec:gpg']}). We re-interpret grasping as rendering of local surface points and query their features from the shared 3D feature volume. Local points, their features, and the 6DoF grasp pose are passed to a Grasping PointNetwork to predict per grasp quality (cf. \ref{['subsec:local']}). NeuGraspNet effectively learns the interaction between the objects' geometry and the gripper to detect high-fidelity grasps.
Figure 2: Scene-level surface rendering: (a) an input single-view pointcloud; (b) surface rendering on the neural implicit geometry (grey volume) using 6 'virtual' cameras; (c) the reconstructed surface pointcloud; (d) sampled grasp candidates.
Figure 3: Local surface rendering: (a) rendering the neural implicit geometry by ray-marching 3 'virtual' cameras at the three parts of the gripper (gripper used here only for visualization); (b) the neural rendered surface; (c) noisy ground-truth rendered surface used during training for local occupancy supervision (light pink points are unoccupied and dark red points are occupied); (d) ground-truth simulated scene.
Figure 4: Example scene reconstructions & detected grasps for unseen test objects from the VGN Breyer2020 (top) and the EGAD morrison2020egad (bottom) datasets. We see that our network can sometimes create artifacts or is unable to reconstruct very fine details, especially for the hard EGAD objects. Nevertheless, even in these hard cases, our network is able to reconstruct the broad structure of the scene & objects which results in the detection of good grasps (d).
Figure 5: Example failure cases observed in (a) simulated and (b) real-world experiments.
...and 4 more figures

Learning Any-View 6DoF Robotic Grasping in Cluttered Scenes via Neural Surface Rendering

TL;DR

Abstract

Learning Any-View 6DoF Robotic Grasping in Cluttered Scenes via Neural Surface Rendering

Authors

TL;DR

Abstract

Table of Contents

Figures (9)