Table of Contents
Fetching ...

Contact-GraspNet: Efficient 6-DoF Grasp Generation in Cluttered Scenes

Martin Sundermeyer, Arsalan Mousavian, Rudolph Triebel, Dieter Fox

TL;DR

This work tackles the challenge of 6-DoF grasp generation for unknown objects in cluttered scenes by introducing Contact-GraspNet, an end-to-end network that predicts grasps directly from depth data. It employs a novel contact-point grasp representation that anchors the grasp pose to observed surface points, reducing the learnable space from $SE(3)$ to 4-DoF and enabling efficient, diverse, collision-aware grasp generation. Trained on $17.7$ million simulated grasps from the ACRONYM dataset using a PointNet++-based architecture, the method delivers fast inference (0.28s per scene) and achieves up to $90\%$ first-attempt success in real-robot experiments, outperforming prior state-of-the-art methods. The approach is robust to imperfect segmentation, supports local ROI processing, and facilitates reactive closed-loop grasping in cluttered environments, representing a significant step toward reliable autonomous manipulation in unstructured settings.

Abstract

Grasping unseen objects in unconstrained, cluttered environments is an essential skill for autonomous robotic manipulation. Despite recent progress in full 6-DoF grasp learning, existing approaches often consist of complex sequential pipelines that possess several potential failure points and run-times unsuitable for closed-loop grasping. Therefore, we propose an end-to-end network that efficiently generates a distribution of 6-DoF parallel-jaw grasps directly from a depth recording of a scene. Our novel grasp representation treats 3D points of the recorded point cloud as potential grasp contacts. By rooting the full 6-DoF grasp pose and width in the observed point cloud, we can reduce the dimensionality of our grasp representation to 4-DoF which greatly facilitates the learning process. Our class-agnostic approach is trained on 17 million simulated grasps and generalizes well to real world sensor data. In a robotic grasping study of unseen objects in structured clutter we achieve over 90% success rate, cutting the failure rate in half compared to a recent state-of-the-art method.

Contact-GraspNet: Efficient 6-DoF Grasp Generation in Cluttered Scenes

TL;DR

This work tackles the challenge of 6-DoF grasp generation for unknown objects in cluttered scenes by introducing Contact-GraspNet, an end-to-end network that predicts grasps directly from depth data. It employs a novel contact-point grasp representation that anchors the grasp pose to observed surface points, reducing the learnable space from to 4-DoF and enabling efficient, diverse, collision-aware grasp generation. Trained on million simulated grasps from the ACRONYM dataset using a PointNet++-based architecture, the method delivers fast inference (0.28s per scene) and achieves up to first-attempt success in real-robot experiments, outperforming prior state-of-the-art methods. The approach is robust to imperfect segmentation, supports local ROI processing, and facilitates reactive closed-loop grasping in cluttered environments, representing a significant step toward reliable autonomous manipulation in unstructured settings.

Abstract

Grasping unseen objects in unconstrained, cluttered environments is an essential skill for autonomous robotic manipulation. Despite recent progress in full 6-DoF grasp learning, existing approaches often consist of complex sequential pipelines that possess several potential failure points and run-times unsuitable for closed-loop grasping. Therefore, we propose an end-to-end network that efficiently generates a distribution of 6-DoF parallel-jaw grasps directly from a depth recording of a scene. Our novel grasp representation treats 3D points of the recorded point cloud as potential grasp contacts. By rooting the full 6-DoF grasp pose and width in the observed point cloud, we can reduce the dimensionality of our grasp representation to 4-DoF which greatly facilitates the learning process. Our class-agnostic approach is trained on 17 million simulated grasps and generalizes well to real world sensor data. In a robotic grasping study of unseen objects in structured clutter we achieve over 90% success rate, cutting the failure rate in half compared to a recent state-of-the-art method.

Paper Structure

This paper contains 14 sections, 8 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Contact-GraspNet efficiently predicts diverse and stable grasps in cluttered scenes while avoiding collisions.
  • Figure 2: Training Data Pipeline. We place object meshes with dense grasp annotations from the ACRONYM dataset eppner2021icra at random stable poses in scenes. Grasp poses that produce gripper model collisions are removed. Resulting grasps are mapped to their contacts on the mesh surface. During training, we sample virtual cameras to render point clouds from the scenes. We consider recorded points (yellow) as positive contacts if there exists a mesh contact (blue) in a 5mm radius and associate the grasp transformation belonging to the closest mesh contact to them. These per-point annotations are used to supervise the Contact Grasp Network.
  • Figure 3: Our grasp representation: $c$ depicts an observed contact point. $\mathbf{a}$ and $\mathbf{b}$ constitute the 3-DoF rotation, $w$ is the predicted grasp width, $d$ the distance from baseline to base frame. In pink we show the five gripper points $\mathbf{v}$ that we used in the $l_{add-s}$ loss.
  • Figure 4: Full Inference Pipeline: We segment unknown objects from an RGB-D image using Xiang2020LearningRF. Our Contact-GraspNet processes the full scene point cloud or a local region of interest around a target object. Predicted 6-DoF grasps are then associated to object segments by filtering their contact points. On the right we show the predicted 6-DoF grasp distribution and, in bold, the most confident grasp per segment.
  • Figure 5: Loss Ablations: Without weighted binning in the grasp width loss $l_{width}$ both, success rate and coverage decrease. The $l_{add-s}$ loss leads to increased success rates at high confidence contacts (Coverage $\in [0,0.1]$) and to slightly decreased success rate in the low-confidence regime. This confidence calibration is important, since it determines which grasp is eventually executed.
  • ...and 2 more figures