HandDiffuse: Generative Controllers for Two-Hand Interactions via Diffusion Models

Pei Lin; Sihang Xu; Hongdi Yang; Yiran Liu; Xin Chen; Jingya Wang; Jingyi Yu; Lan Xu

HandDiffuse: Generative Controllers for Two-Hand Interactions via Diffusion Models

Pei Lin, Sihang Xu, Hongdi Yang, Yiran Liu, Xin Chen, Jingya Wang, Jingyi Yu, Lan Xu

TL;DR

This work tackles the scarcity of temporally rich, strongly interacting two-hand motion data by introducing HandDiffuse12.5M and a diffusion-based baseline HandDiffuse. The method employs two denoisers (Single Hand Denoiser and Interacting Hands Denoiser) and two motion representations (local and global), augmented by an Interaction Loss to model dynamic hand contact. Empirical results show HandDiffuse surpasses state-of-the-art methods in quality and diversity, and the dataset enables applications such as in-betweening, trajectory-conditioned generation, and data augmentation for other datasets. The dataset and models are released to spur further research in two-hand interaction modeling for AR/VR, robotics, and avatars.

Abstract

Existing hands datasets are largely short-range and the interaction is weak due to the self-occlusion and self-similarity of hands, which can not yet fit the need for interacting hands motion generation. To rescue the data scarcity, we propose HandDiffuse12.5M, a novel dataset that consists of temporal sequences with strong two-hand interactions. HandDiffuse12.5M has the largest scale and richest interactions among the existing two-hand datasets. We further present a strong baseline method HandDiffuse for the controllable motion generation of interacting hands using various controllers. Specifically, we apply the diffusion model as the backbone and design two motion representations for different controllers. To reduce artifacts, we also propose Interaction Loss which explicitly quantifies the dynamic interaction process. Our HandDiffuse enables various applications with vivid two-hand interactions, i.e., motion in-betweening and trajectory control. Experiments show that our method outperforms the state-of-the-art techniques in motion generation and can also contribute to data augmentation for other datasets. Our dataset, corresponding codes, and pre-trained models will be disseminated to the community for future research towards two-hand interaction modeling.

HandDiffuse: Generative Controllers for Two-Hand Interactions via Diffusion Models

TL;DR

Abstract

Paper Structure (18 sections, 8 equations, 7 figures, 3 tables)

This paper contains 18 sections, 8 equations, 7 figures, 3 tables.

Introduction
Related Works
Hand Dataset.
Hands Capture & Reconstruction.
Human Motion generation.
HandDiffuse12.5M Dataset
Capture system
2/3D joint coordinates & MANO fitting
Quantitative Evaluation for our Dataset.
Method
Two Denoisers & Interacting Hands DDIM
Motion Representations for Interacting Hands.
Interaction Loss $\mathbf{Loss}_{interaction}$
Downstream Applications
Experiment
...and 3 more sections

Figures (7)

Figure 1: Overview: The proposed HandDiffuse12.5M benchmark dataset consists of temporal sequences with strong interaction. Based on it, we propose HandDiffuse, a strong baseline for the motion generation of interacting hands.
Figure 2: The capture system and reprojectoin of MANO. The proposed HandDiffuse12.5M benchmark dataset consists of strong and various interaction with accurate annotation.
Figure 3: The distribution of HandDiffuse12.5M’s temporal frames.
Figure 4: t-SNE visualization of HandDiffuse12.5M(Ours), InterHand2.6M and Re:InterHand. For InterHand2.6M, we only choose its temporal frames. The result indicates the diversity and richness of HandDiffuse12.5M.
Figure 5: (Left) Overview of HandDiffuse We first generate the motions for each hand separately by training Single Hand Denoiser which only focuses on local poses. The generated two single hands' local pose are concatenated with global information(noise) and transformed into designed motion representation. The interacting hands denoiser further optimizes the interaction process. (Right) Overview of Downstream Applications. When the control is key frames, we generate the final motions $\mathbf{\hat{x}_0^{1:N}}$ by giving the first 5 frames $\mathbf{x_{GT}^{:5}}$ and the last 5 frames $\mathbf{x_{GT}^{-5:}}$ before denoising. When the control os wrists' trajectories, we generate the final motion $\mathbf{\hat{x}_0}$ by giving the root angular velocity and linear velocity in each frame before denoising. Other controls like 2D keypoints has been illustrated in Appendix due to the space limitation.
...and 2 more figures

HandDiffuse: Generative Controllers for Two-Hand Interactions via Diffusion Models

TL;DR

Abstract

HandDiffuse: Generative Controllers for Two-Hand Interactions via Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)