Table of Contents
Fetching ...

DexGraspNet 2.0: Learning Generative Dexterous Grasping in Large-scale Synthetic Cluttered Scenes

Jialiang Zhang, Haoran Liu, Danshi Li, Xinqiang Yu, Haoran Geng, Yufei Ding, Jiayi Chen, He Wang

TL;DR

This work presents a large-scale synthetic benchmark, encompassing 1319 objects, 8270 scenes, and 427 million grasps, and proposes a novel two-stage grasping method that learns efficiently from data by using a diffusion model that conditions on local geometry.

Abstract

Grasping in cluttered scenes remains highly challenging for dexterous hands due to the scarcity of data. To address this problem, we present a large-scale synthetic benchmark, encompassing 1319 objects, 8270 scenes, and 427 million grasps. Beyond benchmarking, we also propose a novel two-stage grasping method that learns efficiently from data by using a diffusion model that conditions on local geometry. Our proposed generative method outperforms all baselines in simulation experiments. Furthermore, with the aid of test-time-depth restoration, our method demonstrates zero-shot sim-to-real transfer, attaining 90.7% real-world dexterous grasping success rate in cluttered scenes.

DexGraspNet 2.0: Learning Generative Dexterous Grasping in Large-scale Synthetic Cluttered Scenes

TL;DR

This work presents a large-scale synthetic benchmark, encompassing 1319 objects, 8270 scenes, and 427 million grasps, and proposes a novel two-stage grasping method that learns efficiently from data by using a diffusion model that conditions on local geometry.

Abstract

Grasping in cluttered scenes remains highly challenging for dexterous hands due to the scarcity of data. To address this problem, we present a large-scale synthetic benchmark, encompassing 1319 objects, 8270 scenes, and 427 million grasps. Beyond benchmarking, we also propose a novel two-stage grasping method that learns efficiently from data by using a diffusion model that conditions on local geometry. Our proposed generative method outperforms all baselines in simulation experiments. Furthermore, with the aid of test-time-depth restoration, our method demonstrates zero-shot sim-to-real transfer, attaining 90.7% real-world dexterous grasping success rate in cluttered scenes.

Paper Structure

This paper contains 49 sections, 7 equations, 15 figures, 11 tables.

Figures (15)

  • Figure 1: Overview.Simulation Dataset: DexGraspNet 2.0 contains 427M grasps (4 random grasps are visualized in each scene here for clarity). Real-world Execution: first row are model-generated grasps conditioned on real-world single-view depth point clouds, second row are top-ranked grasps, and third row are real-world executions.
  • Figure 2: DexGraspNet 2.0 Benchmark includes 7600 training scenes and an average of 50k+ grasps per scene, totaling 400M+ grasp labels. The images at the same position in the first and second row correspond to the same scene. Each colored point in the second row represents the palm position of a grasp label, with different colors indicating grasp poses on different objects. For simplicity, grasp labels in each scene are downsampled to 1000.
  • Figure 3: Method architecture. Our method leverages a generative model conditioned on local features and models the distribution of grasp poses ($T$,$R$,$\theta$) in a decomposed way. Inference: The model receives a single-view depth point cloud and generates multiple grasps (only one is visualized). Training: The model takes the depth point cloud and ground-truth annotations to learn data distribution.
  • Figure 4: Scaling the number of scenes/grasps.
  • Figure 5: Real-world experiment objects.
  • ...and 10 more figures