Table of Contents
Fetching ...

End-to-End Dexterous Grasp Learning from Single-View Point Clouds via a Multi-Object Scene Dataset

Tao Geng, Dapeng Yang, Ziwei Liu, Le Zhang, Le Qi, WangYang Li, Yi Ren, Shan Luo, Fenglei Ni

Abstract

Dexterous grasping in multi-object scene constitutes a fundamental challenge in robotic manipulation. Current mainstream grasping datasets predominantly focus on single-object scenarios and predefined grasp configurations, often neglecting environmental interference and the modeling of dexterous pre-grasp gesture, thereby limiting their generalizability in real-world applications. To address this, we propose DGS-Net, an end-to-end grasp prediction network capable of learning dense grasp configurations from single-view point clouds in multi-object scene. Furthermore, we propose a two-stage grasp data generation strategy that progresses from dense single-object grasp synthesis to dense scene-level grasp generation. Our dataset comprises 307 objects, 240 multi-object scenes, and over 350k validated grasps. By explicitly modeling grasp offsets and pre-grasp configurations, the dataset provides more robust and accurate supervision for dexterous grasp learning. Experimental results show that DGS-Net achieves grasp success rates of 88.63\% in simulation and 78.98\% on a real robotic platform, while exhibiting lower penetration with a mean penetration depth of 0.375 mm and penetration volume of 559.45 mm^3, outperforming existing methods and demonstrating strong effectiveness and generalization capability. Our dataset is available at https://github.com/4taotao8/DGS-Net.

End-to-End Dexterous Grasp Learning from Single-View Point Clouds via a Multi-Object Scene Dataset

Abstract

Dexterous grasping in multi-object scene constitutes a fundamental challenge in robotic manipulation. Current mainstream grasping datasets predominantly focus on single-object scenarios and predefined grasp configurations, often neglecting environmental interference and the modeling of dexterous pre-grasp gesture, thereby limiting their generalizability in real-world applications. To address this, we propose DGS-Net, an end-to-end grasp prediction network capable of learning dense grasp configurations from single-view point clouds in multi-object scene. Furthermore, we propose a two-stage grasp data generation strategy that progresses from dense single-object grasp synthesis to dense scene-level grasp generation. Our dataset comprises 307 objects, 240 multi-object scenes, and over 350k validated grasps. By explicitly modeling grasp offsets and pre-grasp configurations, the dataset provides more robust and accurate supervision for dexterous grasp learning. Experimental results show that DGS-Net achieves grasp success rates of 88.63\% in simulation and 78.98\% on a real robotic platform, while exhibiting lower penetration with a mean penetration depth of 0.375 mm and penetration volume of 559.45 mm^3, outperforming existing methods and demonstrating strong effectiveness and generalization capability. Our dataset is available at https://github.com/4taotao8/DGS-Net.
Paper Structure (13 sections, 9 equations, 6 figures, 5 tables)

This paper contains 13 sections, 9 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: (a) Joint configuration. (b) Fingertip points and coordinate systems.
  • Figure 2: Overview of DGS-Net. PTv3 extracts multi-scale point cloud features. Module I (green) predicts grasp reference point and 6D rotations; Module II (yellow) generates grasp offsets and joint configurations. During training (bottom), ground-truth grasp reference point and pose labels guide Module I to speed convergence and improve stability. During inference (top), Module II uses Module I predictions, forming an end-to-end pipeline.
  • Figure 3: (a) Single-object grasp data generation pipeline, generating dense grasps for a single object using fingertip point constraints and admittance control. (b) Circular/rectangular fingertip point constraints. (c) Grasps under different fingertip point constraints. (d) Grasp joint configurations under circular constraints with varying parameters $r$ and $h$.
  • Figure 4: (a) Scene-level grasp data generation pipeline. (b) Grasp labels on point cloud: blue represents object point clouds; purple denotes the dexterous hand's TCP $T$; Red indicates grasp reference points, which are the points in the object point cloud closest to the hand's TCP; green denotes grasp candidate points in the neighborhood of each grasp reference point; gray point clouds correspond to the tabletop.
  • Figure 5: Real-world execution pipeline: (top) model prediction, (middle) pre-grasp configuration, and (bottom) final grasping execution.
  • ...and 1 more figures