Table of Contents
Fetching ...

GraspGen: A Diffusion-based Framework for 6-DOF Grasping with On-Generator Training

Adithyavairavan Murali, Balakumar Sundaralingam, Yu-Wei Chao, Wentao Yuan, Jun Yamada, Mark Carlson, Fabio Ramos, Stan Birchfield, Dieter Fox, Clemens Eppner

TL;DR

GraspGen introduces a diffusion-based, object-centric framework for 6-DOF grasping that jointly generates diverse grasps and evaluates them with an on-generator discriminator trained on inference-time data. By factorizing SE(3) diffusion into $SO(3) \times \mathbb{R}^3$ and employing a dedicated $T=10$ step DDPM, GraspGen achieves high-quality grasps across multiple grippers and clutter levels, validated on both simulations and a real UR10 system. A large-scale 53M-grasp dataset and the On-Generator training recipe are key contributions that improve generalization and sim-to-real transfer, culminating in state-of-the-art performance on FetchBench. The work also provides extensive ablations, real-world demonstrations, and practical guidance on architecture choices, normalization, and inference tuning, highlighting both the method's practical impact and current limitations related to sensing quality and compute demands.

Abstract

Grasping is a fundamental robot skill, yet despite significant research advancements, learning-based 6-DOF grasping approaches are still not turnkey and struggle to generalize across different embodiments and in-the-wild settings. We build upon the recent success on modeling the object-centric grasp generation process as an iterative diffusion process. Our proposed framework, GraspGen, consists of a DiffusionTransformer architecture that enhances grasp generation, paired with an efficient discriminator to score and filter sampled grasps. We introduce a novel and performant on-generator training recipe for the discriminator. To scale GraspGen to both objects and grippers, we release a new simulated dataset consisting of over 53 million grasps. We demonstrate that GraspGen outperforms prior methods in simulations with singulated objects across different grippers, achieves state-of-the-art performance on the FetchBench grasping benchmark, and performs well on a real robot with noisy visual observations.

GraspGen: A Diffusion-based Framework for 6-DOF Grasping with On-Generator Training

TL;DR

GraspGen introduces a diffusion-based, object-centric framework for 6-DOF grasping that jointly generates diverse grasps and evaluates them with an on-generator discriminator trained on inference-time data. By factorizing SE(3) diffusion into and employing a dedicated step DDPM, GraspGen achieves high-quality grasps across multiple grippers and clutter levels, validated on both simulations and a real UR10 system. A large-scale 53M-grasp dataset and the On-Generator training recipe are key contributions that improve generalization and sim-to-real transfer, culminating in state-of-the-art performance on FetchBench. The work also provides extensive ablations, real-world demonstrations, and practical guidance on architecture choices, normalization, and inference tuning, highlighting both the method's practical impact and current limitations related to sensing quality and compute demands.

Abstract

Grasping is a fundamental robot skill, yet despite significant research advancements, learning-based 6-DOF grasping approaches are still not turnkey and struggle to generalize across different embodiments and in-the-wild settings. We build upon the recent success on modeling the object-centric grasp generation process as an iterative diffusion process. Our proposed framework, GraspGen, consists of a DiffusionTransformer architecture that enhances grasp generation, paired with an efficient discriminator to score and filter sampled grasps. We introduce a novel and performant on-generator training recipe for the discriminator. To scale GraspGen to both objects and grippers, we release a new simulated dataset consisting of over 53 million grasps. We demonstrate that GraspGen outperforms prior methods in simulations with singulated objects across different grippers, achieves state-of-the-art performance on the FetchBench grasping benchmark, and performs well on a real robot with noisy visual observations.

Paper Structure

This paper contains 28 sections, 1 equation, 15 figures, 6 tables, 1 algorithm.

Figures (15)

  • Figure 2: Architecture for the diffusion noise prediction network.
  • Figure 3: Object-centric evaluation on Franka-ACRONYM eppner2021acronym
  • Figure 4: Large-scale evaluation on FetchBench han2024fetchbench. GraspGen surpasses all previous methods.
  • Figure 5: Evaluating on complete (left) vs. single-view point clouds (right)
  • Figure 6: Distribution Shift in the On-Generator vs. Offline Datasets (left) and Ablation on Trained Models (right)
  • ...and 10 more figures