Table of Contents
Fetching ...

Grasp Diffusion Network: Learning Grasp Generators from Partial Point Clouds with Diffusion Models in SO(3)xR3

Joao Carvalho, An T. Le, Philipp Jahr, Qiao Sun, Julen Urain, Dorothea Koert, Jan Peters

TL;DR

This work addresses grasp generation from partial point clouds by learning a multimodal distribution over $SE(3)$ grasps using diffusion on the manifold $SO(3)\times \mathbb{R}^3$. It introduces Grasp Diffusion Network (GDN), with rotation diffusion via the isotropic $\mathcal{IG}_{SO(3)}$ distribution and translation diffusion in $\mathbb{R}^3$, coupled through a joint diffusion over $SE(3)$. A key contribution is collision-cost guided sampling that biases diffusion-based inference toward collision-free grasps, implemented with gradient guidance and pose-parameterization, plus acceleration via DDIM. Empirical results in simulation and real-world table-top tasks show that GDN yields higher grasp success rates and more diverse, realistic grasps than baselines, while achieving faster inference times suitable for real-time use. The work demonstrates meaningful advancement in robust, single-view grasping and points to future work on theory and cluttered-scene manipulation.

Abstract

Grasping objects successfully from a single-view camera is crucial in many robot manipulation tasks. An approach to solve this problem is to leverage simulation to create large datasets of pairs of objects and grasp poses, and then learn a conditional generative model that can be prompted quickly during deployment. However, the grasp pose data is highly multimodal since there are several ways to grasp an object. Hence, in this work, we learn a grasp generative model with diffusion models to sample candidate grasp poses given a partial point cloud of an object. A novel aspect of our method is to consider diffusion in the manifold space of rotations and to propose a collision-avoidance cost guidance to improve the grasp success rate during inference. To accelerate grasp sampling we use recent techniques from the diffusion literature to achieve faster inference times. We show in simulation and real-world experiments that our approach can grasp several objects from raw depth images with $90\%$ success rate and benchmark it against several baselines.

Grasp Diffusion Network: Learning Grasp Generators from Partial Point Clouds with Diffusion Models in SO(3)xR3

TL;DR

This work addresses grasp generation from partial point clouds by learning a multimodal distribution over grasps using diffusion on the manifold . It introduces Grasp Diffusion Network (GDN), with rotation diffusion via the isotropic distribution and translation diffusion in , coupled through a joint diffusion over . A key contribution is collision-cost guided sampling that biases diffusion-based inference toward collision-free grasps, implemented with gradient guidance and pose-parameterization, plus acceleration via DDIM. Empirical results in simulation and real-world table-top tasks show that GDN yields higher grasp success rates and more diverse, realistic grasps than baselines, while achieving faster inference times suitable for real-time use. The work demonstrates meaningful advancement in robust, single-view grasping and points to future work on theory and cluttered-scene manipulation.

Abstract

Grasping objects successfully from a single-view camera is crucial in many robot manipulation tasks. An approach to solve this problem is to leverage simulation to create large datasets of pairs of objects and grasp poses, and then learn a conditional generative model that can be prompted quickly during deployment. However, the grasp pose data is highly multimodal since there are several ways to grasp an object. Hence, in this work, we learn a grasp generative model with diffusion models to sample candidate grasp poses given a partial point cloud of an object. A novel aspect of our method is to consider diffusion in the manifold space of rotations and to propose a collision-avoidance cost guidance to improve the grasp success rate during inference. To accelerate grasp sampling we use recent techniques from the diffusion literature to achieve faster inference times. We show in simulation and real-world experiments that our approach can grasp several objects from raw depth images with success rate and benchmark it against several baselines.

Paper Structure

This paper contains 17 sections, 13 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Setup for real-world experiments and overlay execution of a successful grasp. Given an object partial point cloud (blue dots), GDN generates multimodal grasps and executes one successfully.
  • Figure 2: \ref{['fig:gdn_method_overview']} The input is a partial point cloud view of the object to grasp (blue dots), and GDN outputs a distribution of gripper poses by denoising in the $\textsc{SO(3)} \times \mathbb{R}^3$ manifold. The denoising network (a conditional ResNet) computes vectors for translation and rotation ${\bm{\epsilon}}^{{\bm{t}}}_{{\bm{\theta}}}$ and ${\bm{\epsilon}}^{{\bm{R}}}_{{\bm{\theta}}}$ in the Lie algebra. These vectors are used to update the means of the posterior distribution, optionally using collision-avoidance cost guidance with the gradients ${\bm{g}}$. \ref{['fig:grasp_samples_diffusion_ddim']} Grasp samples generated with GDN using DDPM and DDIM sampling methods. The results align with those from \ref{['fig:simulation_results_ddim_cat10']}. DDIM produces successful grasps (in green) but with less variability.
  • Figure 3: Examples of grasp samples generated with models trained on the CAT10 category. The $1$st and $5$th columns show more noticeable differences between the methods.
  • Figure 4: Performance of different grasp generator models on CAT10 categories. The diamond in the boxplots shows the mean of the success rate or EMD. The results show that GDN can generate high-diversity grasps (low EMD) and precise ones (high success rate), either at the same level or better than the baselines.
  • Figure 5: GDN with fewer inference steps using DDIM. The diamond in the boxplots shows the mean of the success rate or EMD. The results show that several steps can be skipped during the denoising process without losing too much on grasp success rate but sacrificing grasp variability, as seen by the increase in EMD.
  • ...and 2 more figures