Table of Contents
Fetching ...

FastGrasp: Efficient Grasp Synthesis with Diffusion

Xiaofei Wu, Tao Liu, Caoji Li, Yuexin Ma, Yujiao Shi, Xuming He

TL;DR

A novel diffusion-modelbased approach that generates the grasping pose in a one-stage manner and develops a Latent Diffusion Model with an Adaptation Module for object-conditioned hand pose generation and a contact-aware loss to enforce the physical constraints between hands and objects.

Abstract

Effectively modeling the interaction between human hands and objects is challenging due to the complex physical constraints and the requirement for high generation efficiency in applications. Prior approaches often employ computationally intensive two-stage approaches, which first generate an intermediate representation, such as contact maps, followed by an iterative optimization procedure that updates hand meshes to capture the hand-object relation. However, due to the high computation complexity during the optimization stage, such strategies often suffer from low efficiency in inference. To address this limitation, this work introduces a novel diffusion-model-based approach that generates the grasping pose in a one-stage manner. This allows us to significantly improve generation speed and the diversity of generated hand poses. In particular, we develop a Latent Diffusion Model with an Adaptation Module for object-conditioned hand pose generation and a contact-aware loss to enforce the physical constraints between hands and objects. Extensive experiments demonstrate that our method achieves faster inference, higher diversity, and superior pose quality than state-of-the-art approaches. Code is available at \href{https://github.com/wuxiaofei01/FastGrasp}{https://github.com/wuxiaofei01/FastGrasp.}

FastGrasp: Efficient Grasp Synthesis with Diffusion

TL;DR

A novel diffusion-modelbased approach that generates the grasping pose in a one-stage manner and develops a Latent Diffusion Model with an Adaptation Module for object-conditioned hand pose generation and a contact-aware loss to enforce the physical constraints between hands and objects.

Abstract

Effectively modeling the interaction between human hands and objects is challenging due to the complex physical constraints and the requirement for high generation efficiency in applications. Prior approaches often employ computationally intensive two-stage approaches, which first generate an intermediate representation, such as contact maps, followed by an iterative optimization procedure that updates hand meshes to capture the hand-object relation. However, due to the high computation complexity during the optimization stage, such strategies often suffer from low efficiency in inference. To address this limitation, this work introduces a novel diffusion-model-based approach that generates the grasping pose in a one-stage manner. This allows us to significantly improve generation speed and the diversity of generated hand poses. In particular, we develop a Latent Diffusion Model with an Adaptation Module for object-conditioned hand pose generation and a contact-aware loss to enforce the physical constraints between hands and objects. Extensive experiments demonstrate that our method achieves faster inference, higher diversity, and superior pose quality than state-of-the-art approaches. Code is available at \href{https://github.com/wuxiaofei01/FastGrasp}{https://github.com/wuxiaofei01/FastGrasp.}

Paper Structure

This paper contains 26 sections, 9 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Model training architecture. We divide the training process into two parts. In the first part, we use a latent diffusion model to generate grasping poses from object point clouds. However, the diffusion model struggles to directly learn the physical constraints between the hand and object, leading to issues such as penetration and displacement. To address this, the second part involves training an Adaptation Module to refine the grasping gestures by aligning them with the physical constraints of hand-object interactions, resulting in more natural and feasible poses. In training stage one, only the solid arrow path is utilized. In stage two, both the solid and dotted arrow paths are used.
  • Figure 2: Model inference architecture. We start by inputting Gaussian noise and the object’s point cloud into the model. The diffusion model then generates hand representations in latent space. The Adaptation Module refines these representations, which are then decoded into MANO parameters. Finally, we construct the hand mesh using the MANO layer.
  • Figure 5: User study results. The numbers indicate the percentage of users who rate the corresponding method as more realistic.
  • Figure 6: To assess the impact of a physically constrained loss function, we compare model performance with and without it. Each pair of columns shows generated grasps from two distinct views. The first row uses only the reconstruction loss, while the second row presents results from our proposed pipeline. Our method significantly reduces object penetration compared to using the reconstruction loss alone.
  • Figure 7: To evaluate the necessity of hand vertices as inputs, we visualize the model's output using both hand parameters and hand vertices. Each pair of columns shows generated grasps from two different views. The first row presents results with hand parameter input, while the second row displays results from our pipeline. Our method enhances performance by capturing hand joint details and improving rotational accuracy, which reduces object penetration.
  • ...and 10 more figures