Table of Contents
Fetching ...

Single-View Scene Point Cloud Human Grasp Generation

Yan-Kang Wang, Chengyi Xing, Yi-Lin Wei, Xiao-Ming Wu, Wei-Shi Zheng

TL;DR

This work tackles generating physically plausible human grasps from single-view scene point clouds, a scenario common in real-world perception but challenging due to object incompleteness and scene clutter. The authors introduce S2HGrasp, a two-module framework combining a Global Perception pathway for global object understanding with a DiffuGrasp diffusion-based grasp generator conditioned on scene features. They also release S2HGD, a large synthetic dataset of ~99,000 single-view point clouds for 1,668 objects to support learning and evaluation. Experimental results show end-to-end S2HGrasp outperforms two-stage methods and baseline diffusion models, achieving natural grasps with reduced penetration and good generalization to unseen objects. The work advances practical hand-object interaction modeling in cluttered, real-world viewpoints and provides resources for future research.

Abstract

In this work, we explore a novel task of generating human grasps based on single-view scene point clouds, which more accurately mirrors the typical real-world situation of observing objects from a single viewpoint. Due to the incompleteness of object point clouds and the presence of numerous scene points, the generated hand is prone to penetrating into the invisible parts of the object and the model is easily affected by scene points. Thus, we introduce S2HGrasp, a framework composed of two key modules: the Global Perception module that globally perceives partial object point clouds, and the DiffuGrasp module designed to generate high-quality human grasps based on complex inputs that include scene points. Additionally, we introduce S2HGD dataset, which comprises approximately 99,000 single-object single-view scene point clouds of 1,668 unique objects, each annotated with one human grasp. Our extensive experiments demonstrate that S2HGrasp can not only generate natural human grasps regardless of scene points, but also effectively prevent penetration between the hand and invisible parts of the object. Moreover, our model showcases strong generalization capability when applied to unseen objects. Our code and dataset are available at https://github.com/iSEE-Laboratory/S2HGrasp.

Single-View Scene Point Cloud Human Grasp Generation

TL;DR

This work tackles generating physically plausible human grasps from single-view scene point clouds, a scenario common in real-world perception but challenging due to object incompleteness and scene clutter. The authors introduce S2HGrasp, a two-module framework combining a Global Perception pathway for global object understanding with a DiffuGrasp diffusion-based grasp generator conditioned on scene features. They also release S2HGD, a large synthetic dataset of ~99,000 single-view point clouds for 1,668 objects to support learning and evaluation. Experimental results show end-to-end S2HGrasp outperforms two-stage methods and baseline diffusion models, achieving natural grasps with reduced penetration and good generalization to unseen objects. The work advances practical hand-object interaction modeling in cluttered, real-world viewpoints and provides resources for future research.

Abstract

In this work, we explore a novel task of generating human grasps based on single-view scene point clouds, which more accurately mirrors the typical real-world situation of observing objects from a single viewpoint. Due to the incompleteness of object point clouds and the presence of numerous scene points, the generated hand is prone to penetrating into the invisible parts of the object and the model is easily affected by scene points. Thus, we introduce S2HGrasp, a framework composed of two key modules: the Global Perception module that globally perceives partial object point clouds, and the DiffuGrasp module designed to generate high-quality human grasps based on complex inputs that include scene points. Additionally, we introduce S2HGD dataset, which comprises approximately 99,000 single-object single-view scene point clouds of 1,668 unique objects, each annotated with one human grasp. Our extensive experiments demonstrate that S2HGrasp can not only generate natural human grasps regardless of scene points, but also effectively prevent penetration between the hand and invisible parts of the object. Moreover, our model showcases strong generalization capability when applied to unseen objects. Our code and dataset are available at https://github.com/iSEE-Laboratory/S2HGrasp.
Paper Structure (12 sections, 9 equations, 5 figures, 4 tables)

This paper contains 12 sections, 9 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The scene of our dataset and comparison between our model and GraspCVAE jiang2021hand. The top-left image depicts the scene in our dataset, comprising a table and an object. Below it are images from four random viewpoints out of $36$. The three rows of images on the right side represent, respectively, the single-view images, the generation results of GraspCVAE and our method. The green box indicates the area where the hand-object penetration occurs due to the lack of global perception ability of GraspCVAE.
  • Figure 2: S2HGrasp framework. The scene encoder takes single-view scene point clouds as input and extracts their features through PointNet++ qi2017pointnet++ and a transformer block. The features are then used in point cloud completion (GSP), classification (GCP), and grasp generation (DiffuGrasp). GSP and GCP won't be used in testing. In the DiffuGrasp Training, the model adds noise to the normalized hand parameters and extracts hand features after passing the parameters into the MANO layer romero2022embodied. Then the object features and hand features will be fed into a transformer decoder to predict the original hand parameters. When testing, the DiffuGrasp Sampling starts from a random noise and iteratively denoises it, resulting in final hand parameters.
  • Figure 3: The visualization of generated grasps of our S2HGrasp on our two datasets. The blue objects on the left represent the results of View-S2HGD, while the red objects on the right represent the results of Object-S2HGD.
  • Figure 4: Visualization of point cloud completion and our failure cases. Left: Point cloud completion results, with the input single-view point clouds in the top row and completion results below (gray points for the tabletop, red for the object). Right: Failure cases, with single-view images at the top and corresponding failure cases of our method below.
  • Figure 5: The balance between penetration volume (x-axis) and grasp stability (y-axis). Method whose result point is closer to the origin is more effective.