Table of Contents
Fetching ...

ZeroGrasp: Zero-Shot Shape Reconstruction Enabled Robotic Grasping

Shun Iwase, Zubair Irshad, Katherine Liu, Vitor Guizilini, Robert Lee, Takuya Ikeda, Ayako Amma, Koichi Nishiwaki, Kris Kitani, Rares Ambrus, Sergey Zakharov

TL;DR

ZeroGrasp tackles robust robotic grasping from a single RGB-D observation by jointly reconstructing 3D geometry and predicting 6D grasps in near real-time. The method introduces an octree-based CVAE with a multi-object latent transformer and 3D occlusion fields to reason about inter-object relations and occlusions, plus a simple refinement that uses the reconstruction for contact-based adjustments and collision checks. It is trained on a large synthetic ZeroGrasp-11B dataset and evaluated on GraspNet-1B and ReOcS, achieving state-of-the-art reconstruction and grasping performance, as well as real-robot success gains. This approach improves generalization to unseen objects and demonstrates a practical, scalable pipeline for perception and manipulation in cluttered scenes.

Abstract

Robotic grasping is a cornerstone capability of embodied systems. Many methods directly output grasps from partial information without modeling the geometry of the scene, leading to suboptimal motion and even collisions. To address these issues, we introduce ZeroGrasp, a novel framework that simultaneously performs 3D reconstruction and grasp pose prediction in near real-time. A key insight of our method is that occlusion reasoning and modeling the spatial relationships between objects is beneficial for both accurate reconstruction and grasping. We couple our method with a novel large-scale synthetic dataset, which comprises 1M photo-realistic images, high-resolution 3D reconstructions and 11.3B physically-valid grasp pose annotations for 12K objects from the Objaverse-LVIS dataset. We evaluate ZeroGrasp on the GraspNet-1B benchmark as well as through real-world robot experiments. ZeroGrasp achieves state-of-the-art performance and generalizes to novel real-world objects by leveraging synthetic data.

ZeroGrasp: Zero-Shot Shape Reconstruction Enabled Robotic Grasping

TL;DR

ZeroGrasp tackles robust robotic grasping from a single RGB-D observation by jointly reconstructing 3D geometry and predicting 6D grasps in near real-time. The method introduces an octree-based CVAE with a multi-object latent transformer and 3D occlusion fields to reason about inter-object relations and occlusions, plus a simple refinement that uses the reconstruction for contact-based adjustments and collision checks. It is trained on a large synthetic ZeroGrasp-11B dataset and evaluated on GraspNet-1B and ReOcS, achieving state-of-the-art reconstruction and grasping performance, as well as real-robot success gains. This approach improves generalization to unseen objects and demonstrates a practical, scalable pipeline for perception and manipulation in cluttered scenes.

Abstract

Robotic grasping is a cornerstone capability of embodied systems. Many methods directly output grasps from partial information without modeling the geometry of the scene, leading to suboptimal motion and even collisions. To address these issues, we introduce ZeroGrasp, a novel framework that simultaneously performs 3D reconstruction and grasp pose prediction in near real-time. A key insight of our method is that occlusion reasoning and modeling the spatial relationships between objects is beneficial for both accurate reconstruction and grasping. We couple our method with a novel large-scale synthetic dataset, which comprises 1M photo-realistic images, high-resolution 3D reconstructions and 11.3B physically-valid grasp pose annotations for 12K objects from the Objaverse-LVIS dataset. We evaluate ZeroGrasp on the GraspNet-1B benchmark as well as through real-world robot experiments. ZeroGrasp achieves state-of-the-art performance and generalizes to novel real-world objects by leveraging synthetic data.

Paper Structure

This paper contains 42 sections, 15 equations, 20 figures, 5 tables.

Figures (20)

  • Figure 1: ZeroGrasp simultaneously reconstructs objects at high-resolution and predicts grasp poses from a single RGB-D image in near real-time ($5$FPS).
  • Figure 2: Overview of ZeroGrasp, a novel method for simultaneous 3D reconstruction and 6D grasp pose predictions from a single-view RGB-D image. The input octree $\mathbf{x}$ is first fed into the octree-based CVAE (components with orange boxes). The multi-object encoder takes its latent feature $\ell$ to learn multi-object reasoning at the latent space. Further, 3D occlusion fields encode inter- and self-occlusion information via simple ray casting. The output features from the multi-object encoder and 3D occlusion fields are concatenated with the latent code $\mathbf{z}$, and 3D shapes and grasp poses are predicted by the decoder.
  • Figure 3: 3D occlusion fields localize occlusion information by casting rays from the camera to the voxel centers around the target object and performing depth tests. Specifically, if a ray intersects the target object, a self-occlusion flag $o_{\text{self}}$ is set to 1. If it intersects non-target objects, an inter-object occlusion flag $o_{\text{inter}}$ is set to 1.
  • Figure 4: Example RGB images, stereo depth maps, 3D shapes and grasp poses from the ReOcs and ZeroGrasp-11B datasets. The grasp poses of the ZeroGrasp-11B dataset are subsampled by grasp-NMS fang2020graspnet for better visibility of the 3D shapes and grasps. More examples are found in the supplementary material.
  • Figure 5: Contact-based constraints are used to effectively refine grasp poses. We first obtain contact points $c_L$ and $c_R$. Next, the contact distance $D\left(c_{L|R}\right)$, and the depth is computed by $Z\left(c_{L|R}\right)$ are computed. Finally, the width and height of the grasp is refined based on \ref{['eq:update_width']} and \ref{['eq:update_depth']}.
  • ...and 15 more figures