Table of Contents
Fetching ...

Volumetric Grasping Network: Real-time 6 DOF Grasp Detection in Clutter

Michel Breyer, Jen Jen Chung, Lionel Ott, Roland Siegwart, Juan Nieto

TL;DR

The Volumetric Grasping Network (VGN) enables real-time, 6-DOF grasp detection directly from a TSDF scene representation by predicting a dense per-voxel grasp map consisting of quality, orientation, and width. Trained entirely on synthetic clutter data, VGN outputs a full workspace grasp proposal field in a single forward pass, enabling rapid, collision-aware grasp planning without online sampling. Results show ~10 ms inference on GPU and robust performance in both simulated clutter and real robot experiments, achieving up to 92% objects cleared in real setups. This approach paves the way for closed-loop, disturbance-robust grasping in dynamic, unstructured environments.

Abstract

General robot grasping in clutter requires the ability to synthesize grasps that work for previously unseen objects and that are also robust to physical interactions, such as collisions with other objects in the scene. In this work, we design and train a network that predicts 6 DOF grasps from 3D scene information gathered from an on-board sensor such as a wrist-mounted depth camera. Our proposed Volumetric Grasping Network (VGN) accepts a Truncated Signed Distance Function (TSDF) representation of the scene and directly outputs the predicted grasp quality and the associated gripper orientation and opening width for each voxel in the queried 3D volume. We show that our approach can plan grasps in only 10 ms and is able to clear 92% of the objects in real-world clutter removal experiments without the need for explicit collision checking. The real-time capability opens up the possibility for closed-loop grasp planning, allowing robots to handle disturbances, recover from errors and provide increased robustness. Code is available at https://github.com/ethz-asl/vgn.

Volumetric Grasping Network: Real-time 6 DOF Grasp Detection in Clutter

TL;DR

The Volumetric Grasping Network (VGN) enables real-time, 6-DOF grasp detection directly from a TSDF scene representation by predicting a dense per-voxel grasp map consisting of quality, orientation, and width. Trained entirely on synthetic clutter data, VGN outputs a full workspace grasp proposal field in a single forward pass, enabling rapid, collision-aware grasp planning without online sampling. Results show ~10 ms inference on GPU and robust performance in both simulated clutter and real robot experiments, achieving up to 92% objects cleared in real setups. This approach paves the way for closed-loop, disturbance-robust grasping in dynamic, unstructured environments.

Abstract

General robot grasping in clutter requires the ability to synthesize grasps that work for previously unseen objects and that are also robust to physical interactions, such as collisions with other objects in the scene. In this work, we design and train a network that predicts 6 DOF grasps from 3D scene information gathered from an on-board sensor such as a wrist-mounted depth camera. Our proposed Volumetric Grasping Network (VGN) accepts a Truncated Signed Distance Function (TSDF) representation of the scene and directly outputs the predicted grasp quality and the associated gripper orientation and opening width for each voxel in the queried 3D volume. We show that our approach can plan grasps in only 10 ms and is able to clear 92% of the objects in real-world clutter removal experiments without the need for explicit collision checking. The real-time capability opens up the possibility for closed-loop grasp planning, allowing robots to handle disturbances, recover from errors and provide increased robustness. Code is available at https://github.com/ethz-asl/vgn.

Paper Structure

This paper contains 12 sections, 3 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: A scan of the 3D scene is converted into the that is passed into . has a three-headed output, providing the grasp quality, gripper orientation and gripper width at each voxel. Non-maxima suppression is then applied using the grasp quality output and invalid grasps are filtered out given the input .
  • Figure 2: Examples of "pile" and "packed" scenes (a) and (b) respectively. Subfigure (c) shows the definition of the grasp frame origin with respect to the gripper geometry and (d) shows the distribution of angles between the gravity vector and the $z$ axis of grasps from the training set.
  • Figure 3: The 12 test objects used in our robot grasping experiments.
  • Figure 4: Examples of real world grasps detected by VGN (a)-(b). (c) shows a typical failure case for our model where the fingers slip off the cylinder-shaped object due to a small contact surface. The system is also capable of side-grasps (d) and picking the thin rim of bowls (e).