Diffgrasp: Whole-Body Grasping Synthesis Guided by Object Motion Using a Diffusion Model

Yonghao Zhang; Qiang He; Yanguang Wan; Yinda Zhang; Xiaoming Deng; Cuixia Ma; Hongan Wang

Diffgrasp: Whole-Body Grasping Synthesis Guided by Object Motion Using a Diffusion Model

Yonghao Zhang, Qiang He, Yanguang Wan, Yinda Zhang, Xiaoming Deng, Cuixia Ma, Hongan Wang

TL;DR

DiffGrasp addresses the challenge of generating realistic whole-body human motion with fine finger grasping conditioned on 3D object motion. It introduces a single diffusion model with a transformer-based condition encoder to jointly model body, hands, and object dynamics, augmented by two contact-aware losses and a data-driven guidance strategy to stabilize grasping and prevent penetration. Experimental results on GRAB and ARCTIC show state-of-the-art performance across hand contact, collision, and motion-quality metrics, including generalization to unseen objects, while ablations confirm the effectiveness of the proposed losses and guidance. This work enables more believable bi-manual grasp synthesis for applications in animation, VR/AR, and robotics.

Abstract

Generating high-quality whole-body human object interaction motion sequences is becoming increasingly important in various fields such as animation, VR/AR, and robotics. The main challenge of this task lies in determining the level of involvement of each hand given the complex shapes of objects in different sizes and their different motion trajectories, while ensuring strong grasping realism and guaranteeing the coordination of movement in all body parts. Contrasting with existing work, which either generates human interaction motion sequences without detailed hand grasping poses or only models a static grasping pose, we propose a simple yet effective framework that jointly models the relationship between the body, hands, and the given object motion sequences within a single diffusion model. To guide our network in perceiving the object's spatial position and learning more natural grasping poses, we introduce novel contact-aware losses and incorporate a data-driven, carefully designed guidance. Experimental results demonstrate that our approach outperforms the state-of-the-art method and generates plausible whole-body motion sequences.

Diffgrasp: Whole-Body Grasping Synthesis Guided by Object Motion Using a Diffusion Model

TL;DR

Abstract

Paper Structure (54 sections, 15 equations, 13 figures, 3 tables)

This paper contains 54 sections, 15 equations, 13 figures, 3 tables.

Introduction
Related Work
Human Motion Generation.
Hand Grasp Generation.
Human Object Interaction Generation.
Method
Data Representation
Human Motion Representation.
Object Sequence Representation.
Condition Input.
Conditional Diffusion Model
Model Architecture.
Denoiser Outputs.
Conditional Diffusion Loss.
Contact Label.
...and 39 more sections

Figures (13)

Figure 1: DiffGrasp generates whole-body human grasp sequence with realistic finger-object contact, conditioned on 3D object shape and object motion sequence.
Figure 2: Overview of DiffGrasp Framework. In our conditional diffusion model, we use the given sequence of object motion, object shape and the SMPL-X identity as conditions. After specially designed positional encodings, these embedded conditions are inputted into a transformer-encoder-based condition encoder. Then, a transformer decoder as denoising network predicts a sequence of clean whole-body pose of SMPL-X as well as the wrist joints translations relative to the object centroid. During the inference stage, we reconstruct the SMPL-X pose sequence into a human mesh sequence. Based on carefully designed guidance functions, we control and optimize our predicted results for more stable hand grasping ($\mathcal{G}_{GS}$), less penetration ($\mathcal{G}_{HO}$) and better foot-floor contact ($\mathcal{G}_{Feet}$) through reconstruction guidance strategy.
Figure 3: Illustration of Grasp Stabilization Guidance $\mathcal{G}_{GS}$. 'Handshaking' object movement example: Initially, the generated hand-object relative distance $\kappa$ and the reconstructed hand do not follow the object's (yellow hand) shaking well. We stabilized the hand-object relative distance according to \ref{['eqa:10']} to obtain the wrist position that follows the object's shaking, and then guided the reconstructed wrist position to successfully achieve the handshaking effect.
Figure 4: Qualitative Results of Comparison Experiments. Our model (DiffGrasp) generates more realistic results, with more hand-object contact and less penetration.
Figure 5: Qualitative Results of Ablation Study. In this figure, Full is the abbreviation for Full loss, R. is the abbreviation for Recon, and I. is the abbreviation for Inter.
...and 8 more figures

Diffgrasp: Whole-Body Grasping Synthesis Guided by Object Motion Using a Diffusion Model

TL;DR

Abstract

Diffgrasp: Whole-Body Grasping Synthesis Guided by Object Motion Using a Diffusion Model

Authors

TL;DR

Abstract

Table of Contents

Figures (13)