Table of Contents
Fetching ...

CenterGrasp: Object-Aware Implicit Representation Learning for Simultaneous Shape Reconstruction and 6-DoF Grasp Estimation

Eugenio Chisari, Nick Heppert, Tim Welschehold, Wolfram Burgard, Abhinav Valada

TL;DR

CenterGrasp tackles the problem of simultaneous 3D shape reconstruction and 6-DoF grasp estimation in clutter by making objects explicit through per-pixel heatmaps and latent codes, and by decoupling object-level geometry and grasps with a shape-and-grasp distance function (SGDF) decoder. The method combines an RGB-D image encoder with a per-object latent code and pose predictor and a per-object SGDF decoder to reconstruct full object shapes and a manifold of grasps, enabling holistic grasping that infers invisible regions. Trained entirely on synthetic data, CenterGrasp achieves strong zero-shot generalization to real scenes and outperforms the state-of-the-art GIGA on reconstruction and grasp metrics in simulation, with substantial improvements in real-robot experiments. The approach also demonstrates competitive performance on GraspNet-1Billion and provides a scalable, object-aware framework for robust scene understanding and grasping in cluttered environments, with code and models released publicly.

Abstract

Reliable object grasping is a crucial capability for autonomous robots. However, many existing grasping approaches focus on general clutter removal without explicitly modeling objects and thus only relying on the visible local geometry. We introduce CenterGrasp, a novel framework that combines object awareness and holistic grasping. CenterGrasp learns a general object prior by encoding shapes and valid grasps in a continuous latent space. It consists of an RGB-D image encoder that leverages recent advances to detect objects and infer their pose and latent code, and a decoder to predict shape and grasps for each object in the scene. We perform extensive experiments on simulated as well as real-world cluttered scenes and demonstrate strong scene reconstruction and 6-DoF grasp-pose estimation performance. Compared to the state of the art, CenterGrasp achieves an improvement of 38.5 mm in shape reconstruction and 33 percentage points on average in grasp success. We make the code and trained models publicly available at http://centergrasp.cs.uni-freiburg.de.

CenterGrasp: Object-Aware Implicit Representation Learning for Simultaneous Shape Reconstruction and 6-DoF Grasp Estimation

TL;DR

CenterGrasp tackles the problem of simultaneous 3D shape reconstruction and 6-DoF grasp estimation in clutter by making objects explicit through per-pixel heatmaps and latent codes, and by decoupling object-level geometry and grasps with a shape-and-grasp distance function (SGDF) decoder. The method combines an RGB-D image encoder with a per-object latent code and pose predictor and a per-object SGDF decoder to reconstruct full object shapes and a manifold of grasps, enabling holistic grasping that infers invisible regions. Trained entirely on synthetic data, CenterGrasp achieves strong zero-shot generalization to real scenes and outperforms the state-of-the-art GIGA on reconstruction and grasp metrics in simulation, with substantial improvements in real-robot experiments. The approach also demonstrates competitive performance on GraspNet-1Billion and provides a scalable, object-aware framework for robust scene understanding and grasping in cluttered environments, with code and models released publicly.

Abstract

Reliable object grasping is a crucial capability for autonomous robots. However, many existing grasping approaches focus on general clutter removal without explicitly modeling objects and thus only relying on the visible local geometry. We introduce CenterGrasp, a novel framework that combines object awareness and holistic grasping. CenterGrasp learns a general object prior by encoding shapes and valid grasps in a continuous latent space. It consists of an RGB-D image encoder that leverages recent advances to detect objects and infer their pose and latent code, and a decoder to predict shape and grasps for each object in the scene. We perform extensive experiments on simulated as well as real-world cluttered scenes and demonstrate strong scene reconstruction and 6-DoF grasp-pose estimation performance. Compared to the state of the art, CenterGrasp achieves an improvement of 38.5 mm in shape reconstruction and 33 percentage points on average in grasp success. We make the code and trained models publicly available at http://centergrasp.cs.uni-freiburg.de.
Paper Structure (15 sections, 7 equations, 9 figures, 4 tables)

This paper contains 15 sections, 7 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Overview of CenterGrasp: (\ref{['fig:main_rgb']}) Input image, (\ref{['fig:main_heatmap']}) Object heatmap prediction, (\ref{['fig:main_overlay_multi']}) Subset of all valid and final grasp proposals overlaid on the observed scene, (\ref{['fig:main_reconstr_multi']}) Subset of all valid and final grasp proposals overlaid on the 3D scene reconstruction, (\ref{['fig:main_overlay_single']}) Selected grasps overlaid on the observed scene, (\ref{['fig:main_reconstr_single']}) Selected grasps overlaid on the 3D scene reconstruction.
  • Figure 2: Illustration of the CenterGrasp architecture. First, an RGB-D image is fed into the image encoder which outputs an object heatmap, a pose map, and a latent code map. Next, the object locations in the image are determined by extracting the peaks from the predicted heatmap. At these locations, each object pose and latent code is extracted accordingly. In the second step, the SGDF decoder infers the shape and grasps for each detected object. Finally, the object pose is used to transform the shape and grasp predictions from the canonical frame to the camera frame.
  • Figure 3: Surface and Grasp Reconstruction. To highlight that CenterGrasp learns a continuous prior over the shape and grasp manifold, we randomly sample latent codes from the learned embedding space and reconstruct the surface as well as ten valid grasps for each object.
  • Figure 4: Generated Synthetic Data. To generate our training data, we render a random scene, consisting of a floor, table, and between 1 to 5 objects to yield an RGB image (left), a depth image with simulated sensor noise (center), and an object heatmap (right).
  • Figure 5: A comparison between the mesh reconstruction of GIGA jiang2021synergies and the point cloud reconstruction of CenterGrasp. GIGA yields adequate results in the in-distribution evaluation (i.e. using GIGA objects), but its reconstruction quality drastically decreases in the out-of-distribution settings. On the other hand, CenterGrasp demonstrates good reconstruction quality in all environments, including real-world evaluations.
  • ...and 4 more figures