ZeroGrasp: Zero-Shot Shape Reconstruction Enabled Robotic Grasping
Shun Iwase, Zubair Irshad, Katherine Liu, Vitor Guizilini, Robert Lee, Takuya Ikeda, Ayako Amma, Koichi Nishiwaki, Kris Kitani, Rares Ambrus, Sergey Zakharov
TL;DR
ZeroGrasp tackles robust robotic grasping from a single RGB-D observation by jointly reconstructing 3D geometry and predicting 6D grasps in near real-time. The method introduces an octree-based CVAE with a multi-object latent transformer and 3D occlusion fields to reason about inter-object relations and occlusions, plus a simple refinement that uses the reconstruction for contact-based adjustments and collision checks. It is trained on a large synthetic ZeroGrasp-11B dataset and evaluated on GraspNet-1B and ReOcS, achieving state-of-the-art reconstruction and grasping performance, as well as real-robot success gains. This approach improves generalization to unseen objects and demonstrates a practical, scalable pipeline for perception and manipulation in cluttered scenes.
Abstract
Robotic grasping is a cornerstone capability of embodied systems. Many methods directly output grasps from partial information without modeling the geometry of the scene, leading to suboptimal motion and even collisions. To address these issues, we introduce ZeroGrasp, a novel framework that simultaneously performs 3D reconstruction and grasp pose prediction in near real-time. A key insight of our method is that occlusion reasoning and modeling the spatial relationships between objects is beneficial for both accurate reconstruction and grasping. We couple our method with a novel large-scale synthetic dataset, which comprises 1M photo-realistic images, high-resolution 3D reconstructions and 11.3B physically-valid grasp pose annotations for 12K objects from the Objaverse-LVIS dataset. We evaluate ZeroGrasp on the GraspNet-1B benchmark as well as through real-world robot experiments. ZeroGrasp achieves state-of-the-art performance and generalizes to novel real-world objects by leveraging synthetic data.
