Table of Contents
Fetching ...

Synergies Between Affordance and Geometry: 6-DoF Grasp Detection via Implicit Representations

Zhenyu Jiang, Yifeng Zhu, Maxwell Svetlik, Kuan Fang, Yuke Zhu

TL;DR

This work advances 6-DoF grasp detection in clutter by jointly learning grasp affordance and 3D reconstruction through a shared, differentiable implicit representation. By coupling structured feature grids derived from TSDF fusion with dual implicit heads (one for grasp parameters and one for occupancy), the approach leverages geometry-aware cues while enabling high-resolution, occlusion-aware grasp predictions. Empirical results in simulation and on a real robot show state-of-the-art grasp success and declutter rates, with notable gains under occlusion and partial observations. The method also demonstrates improved 3D reconstruction in graspable regions, highlighting the mutual benefits of multi-task learning on implicit scene representations.

Abstract

Grasp detection in clutter requires the robot to reason about the 3D scene from incomplete and noisy perception. In this work, we draw insight that 3D reconstruction and grasp learning are two intimately connected tasks, both of which require a fine-grained understanding of local geometry details. We thus propose to utilize the synergies between grasp affordance and 3D reconstruction through multi-task learning of a shared representation. Our model takes advantage of deep implicit functions, a continuous and memory-efficient representation, to enable differentiable training of both tasks. We train the model on self-supervised grasp trials data in simulation. Evaluation is conducted on a clutter removal task, where the robot clears cluttered objects by grasping them one at a time. The experimental results in simulation and on the real robot have demonstrated that the use of implicit neural representations and joint learning of grasp affordance and 3D reconstruction have led to state-of-the-art grasping results. Our method outperforms baselines by over 10% in terms of grasp success rate. Additional results and videos can be found at https://sites.google.com/view/rpl-giga2021

Synergies Between Affordance and Geometry: 6-DoF Grasp Detection via Implicit Representations

TL;DR

This work advances 6-DoF grasp detection in clutter by jointly learning grasp affordance and 3D reconstruction through a shared, differentiable implicit representation. By coupling structured feature grids derived from TSDF fusion with dual implicit heads (one for grasp parameters and one for occupancy), the approach leverages geometry-aware cues while enabling high-resolution, occlusion-aware grasp predictions. Empirical results in simulation and on a real robot show state-of-the-art grasp success and declutter rates, with notable gains under occlusion and partial observations. The method also demonstrates improved 3D reconstruction in graspable regions, highlighting the mutual benefits of multi-task learning on implicit scene representations.

Abstract

Grasp detection in clutter requires the robot to reason about the 3D scene from incomplete and noisy perception. In this work, we draw insight that 3D reconstruction and grasp learning are two intimately connected tasks, both of which require a fine-grained understanding of local geometry details. We thus propose to utilize the synergies between grasp affordance and 3D reconstruction through multi-task learning of a shared representation. Our model takes advantage of deep implicit functions, a continuous and memory-efficient representation, to enable differentiable training of both tasks. We train the model on self-supervised grasp trials data in simulation. Evaluation is conducted on a clutter removal task, where the robot clears cluttered objects by grasping them one at a time. The experimental results in simulation and on the real robot have demonstrated that the use of implicit neural representations and joint learning of grasp affordance and 3D reconstruction have led to state-of-the-art grasping results. Our method outperforms baselines by over 10% in terms of grasp success rate. Additional results and videos can be found at https://sites.google.com/view/rpl-giga2021

Paper Structure

This paper contains 29 sections, 12 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: We harness the synergies between affordance and geometry for 6-DoF grasp detection in clutter. Our model jointly learns grasp affordance prediction and 3D reconstruction. Supervision from reconstruction facilitates our model to learn geometrically-aware features for accurate grasps in occluded regions from partial observation. Supervision from grasp, in turn, produce better 3D reconstruction in graspable regions.
  • Figure 2: Model architecture of GIGA. The input is a TSDF fused from the depth image. After a 3D convolution layer, the output 3D voxel features are projected to canonical planes and aggregated into 2D feature grids. After passing each of the three feature planes through three independent U-Nets, we query the local feature at grasp center/occupancy query point with bilinear interpolation. The affordance implicit functions predict grasp parameters from the local feature at the grasp center. The geometry implicit function predicts occupancy probability from the local feature at the query point.
  • Figure 3: Visualization of packed (left) and pile (right) scenarios. In the packed scenario, objects are placed on the table at their canonical poses. In the pile scenario, objects are dropped on the workspace with random poses. These objects are from Google Scanned Objects IgnitionFuel-GoogleResearch-Google-Scanned-Objects and the scenes are rendered with NVISII Morrical20nvisii.
  • Figure 4: Visualization of the grasp affordance landscape and predicted grasps. Red indicates that the method predicts high grasp affordance near the corresponding area. Green indicates successful grasps and Blue failures. The circles highlight interesting examples, such as asymmetric affordance heatmaps and highly occluded objects.
  • Figure 5: Qualitative 3D reconstruction results of a scene rendered from the top view. The circles highlight the contrast.
  • ...and 1 more figures