Grasp Diffusion Network: Learning Grasp Generators from Partial Point Clouds with Diffusion Models in SO(3)xR3
Joao Carvalho, An T. Le, Philipp Jahr, Qiao Sun, Julen Urain, Dorothea Koert, Jan Peters
TL;DR
This work addresses grasp generation from partial point clouds by learning a multimodal distribution over $SE(3)$ grasps using diffusion on the manifold $SO(3)\times \mathbb{R}^3$. It introduces Grasp Diffusion Network (GDN), with rotation diffusion via the isotropic $\mathcal{IG}_{SO(3)}$ distribution and translation diffusion in $\mathbb{R}^3$, coupled through a joint diffusion over $SE(3)$. A key contribution is collision-cost guided sampling that biases diffusion-based inference toward collision-free grasps, implemented with gradient guidance and pose-parameterization, plus acceleration via DDIM. Empirical results in simulation and real-world table-top tasks show that GDN yields higher grasp success rates and more diverse, realistic grasps than baselines, while achieving faster inference times suitable for real-time use. The work demonstrates meaningful advancement in robust, single-view grasping and points to future work on theory and cluttered-scene manipulation.
Abstract
Grasping objects successfully from a single-view camera is crucial in many robot manipulation tasks. An approach to solve this problem is to leverage simulation to create large datasets of pairs of objects and grasp poses, and then learn a conditional generative model that can be prompted quickly during deployment. However, the grasp pose data is highly multimodal since there are several ways to grasp an object. Hence, in this work, we learn a grasp generative model with diffusion models to sample candidate grasp poses given a partial point cloud of an object. A novel aspect of our method is to consider diffusion in the manifold space of rotations and to propose a collision-avoidance cost guidance to improve the grasp success rate during inference. To accelerate grasp sampling we use recent techniques from the diffusion literature to achieve faster inference times. We show in simulation and real-world experiments that our approach can grasp several objects from raw depth images with $90\%$ success rate and benchmark it against several baselines.
