Table of Contents
Fetching ...

GraspLDM: Generative 6-DoF Grasp Synthesis using Latent Diffusion Models

Kuldeep R Barad, Andrej Orsula, Antoine Richard, Jan Dentler, Miguel Olivares-Mendez, Carol Martinez

TL;DR

A modular generative framework for 6-DoF grasp synthesis that uses diffusion models as priors in the latent space of a VAE that enables us to train task-specific models efficiently by only re-training a small denoising network in the low-dimensional latent space, as opposed to existing models that need expensive re-training.

Abstract

Vision-based grasping of unknown objects in unstructured environments is a key challenge for autonomous robotic manipulation. A practical grasp synthesis system is required to generate a diverse set of 6-DoF grasps from which a task-relevant grasp can be executed. Although generative models are suitable for learning such complex data distributions, existing models have limitations in grasp quality, long training times, and a lack of flexibility for task-specific generation. In this work, we present GraspLDM, a modular generative framework for 6-DoF grasp synthesis that uses diffusion models as priors in the latent space of a VAE. GraspLDM learns a generative model of object-centric $SE(3)$ grasp poses conditioned on point clouds. GraspLDM architecture enables us to train task-specific models efficiently by only re-training a small denoising network in the low-dimensional latent space, as opposed to existing models that need expensive re-training. Our framework provides robust and scalable models on both full and partial point clouds. GraspLDM models trained with simulation data transfer well to the real world without any further fine-tuning. Our models provide an 80% success rate for 80 grasp attempts of diverse test objects across two real-world robotic setups. We make our implementation available at https://github.com/kuldeepbrd1/graspldm .

GraspLDM: Generative 6-DoF Grasp Synthesis using Latent Diffusion Models

TL;DR

A modular generative framework for 6-DoF grasp synthesis that uses diffusion models as priors in the latent space of a VAE that enables us to train task-specific models efficiently by only re-training a small denoising network in the low-dimensional latent space, as opposed to existing models that need expensive re-training.

Abstract

Vision-based grasping of unknown objects in unstructured environments is a key challenge for autonomous robotic manipulation. A practical grasp synthesis system is required to generate a diverse set of 6-DoF grasps from which a task-relevant grasp can be executed. Although generative models are suitable for learning such complex data distributions, existing models have limitations in grasp quality, long training times, and a lack of flexibility for task-specific generation. In this work, we present GraspLDM, a modular generative framework for 6-DoF grasp synthesis that uses diffusion models as priors in the latent space of a VAE. GraspLDM learns a generative model of object-centric grasp poses conditioned on point clouds. GraspLDM architecture enables us to train task-specific models efficiently by only re-training a small denoising network in the low-dimensional latent space, as opposed to existing models that need expensive re-training. Our framework provides robust and scalable models on both full and partial point clouds. GraspLDM models trained with simulation data transfer well to the real world without any further fine-tuning. Our models provide an 80% success rate for 80 grasp attempts of diverse test objects across two real-world robotic setups. We make our implementation available at https://github.com/kuldeepbrd1/graspldm .
Paper Structure (17 sections, 7 equations, 11 figures, 2 tables)

This paper contains 17 sections, 7 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: GraspLDM models trained on synthetic data successfully transfer to the real world and provide stable 6-DoF grasps from single-view RGB-D data in the presence of workspace and motion planning constraints.
  • Figure 2: GraspLDM uses a denoising diffusion model in the latent space of a *vae to improve grasp generation performance. It also enables injection of task-conditional guidance in a modular manner.
  • Figure 3: Grasp Latent Diffusion Model (GraspLDM) is composed of a point cloud encoder ($\phi$), a grasp encoder ($\psi$), a grasp decoder ($\xi$), and a latent diffusion module using a score network ($\theta$). The point cloud encoder encodes a point cloud into a shape latent ($\mathbf{z}_{pc}$). At test time, the grasp encoder is not required and we sample the grasp latent $\mathbf{z}_h$ directly from the prior distribution. This latent goes through reverse diffusion before decoding. For task conditional generation, we modify the diffusion score network to accept task context $\mathbf{z}_{task}$.
  • Figure 4: Multi-object grasping environments with Franka gripper in Isaac Gym for success rate evaluation.
  • Figure 5: Grasp generation performance and scaling on full object point clouds ($N=1024$). (a) The mean success rate ina simulation of 300 generated grasps poses per object. (b) SE(3) *emd between ground-truth grasp pose distribution and 100 sampled grasp poses (lower is better). SE3-DiF-1C and SE3-DiF-63C are the SE(3) Grasp Diffusion models from urain2022se.
  • ...and 6 more figures