Table of Contents
Fetching ...

BOTH2Hands: Inferring 3D Hands from Both Text Prompts and Body Dynamics

Wenqian Zhang, Molin Huang, Yuxuan Zhou, Juze Zhang, Jingyi Yu, Jingya Wang, Lan Xu

TL;DR

This work proposes BOTH57M, a novel multi-modal dataset for two-hand motion generation, and provides a strong baseline method, BOTH2Hands, for the novel task: generating vivid two-hand motions from both implicit body dynamics and explicit text prompts.

Abstract

The recently emerging text-to-motion advances have spired numerous attempts for convenient and interactive human motion generation. Yet, existing methods are largely limited to generating body motions only without considering the rich two-hand motions, let alone handling various conditions like body dynamics or texts. To break the data bottleneck, we propose BOTH57M, a novel multi-modal dataset for two-hand motion generation. Our dataset includes accurate motion tracking for the human body and hands and provides pair-wised finger-level hand annotations and body descriptions. We further provide a strong baseline method, BOTH2Hands, for the novel task: generating vivid two-hand motions from both implicit body dynamics and explicit text prompts. We first warm up two parallel body-to-hand and text-to-hand diffusion models and then utilize the cross-attention transformer for motion blending. Extensive experiments and cross-validations demonstrate the effectiveness of our approach and dataset for generating convincing two-hand motions from the hybrid body-and-textual conditions. Our dataset and code will be disseminated to the community for future research.

BOTH2Hands: Inferring 3D Hands from Both Text Prompts and Body Dynamics

TL;DR

This work proposes BOTH57M, a novel multi-modal dataset for two-hand motion generation, and provides a strong baseline method, BOTH2Hands, for the novel task: generating vivid two-hand motions from both implicit body dynamics and explicit text prompts.

Abstract

The recently emerging text-to-motion advances have spired numerous attempts for convenient and interactive human motion generation. Yet, existing methods are largely limited to generating body motions only without considering the rich two-hand motions, let alone handling various conditions like body dynamics or texts. To break the data bottleneck, we propose BOTH57M, a novel multi-modal dataset for two-hand motion generation. Our dataset includes accurate motion tracking for the human body and hands and provides pair-wised finger-level hand annotations and body descriptions. We further provide a strong baseline method, BOTH2Hands, for the novel task: generating vivid two-hand motions from both implicit body dynamics and explicit text prompts. We first warm up two parallel body-to-hand and text-to-hand diffusion models and then utilize the cross-attention transformer for motion blending. Extensive experiments and cross-validations demonstrate the effectiveness of our approach and dataset for generating convincing two-hand motions from the hybrid body-and-textual conditions. Our dataset and code will be disseminated to the community for future research.
Paper Structure (25 sections, 13 equations, 15 figures, 6 tables)

This paper contains 25 sections, 13 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: (a) Our BOTH57M dataset contains rich gestures and body movements. (b) BOTH2Hands is the only model that can handle text prompts and body dynamics as input, generating realistic hand motions at present.
  • Figure 2: BOTH57M dataset focuses on body-hand motions within the daily scene, incorporating a vast and versatile collection of daily gestures. Each segment of hands within the dataset has been manually annotated three different times. For more details, please refer to the supplementary materials.
  • Figure 3: Overview of BOTH2Hands pipeline. Our pipeline initially feeds text prompts and body movements into two separate diffusion models. Subsequently, the text-conditioned outcomes are projected into the body-conditioned hand coordinate system using forward kinematics. Finally, we utilize the two sequences of hand motions as inputs into a cross-attention transformer for motion blending.
  • Figure 4: Qualitative comparisons. BOTH2Hands algorithm with other methods tevet2023humanchen2023executingzhang2023t2mli2023ego are given two conditions: text and motion. Text conditions are listed at the top, and body conditions are listed at the left side with no hands, and the temporal order is from top to bottom. The motions that follow the conditions are circled red.
  • Figure 5: Dataset Evaluation. We train the BOTH2Hands algorithm on the training set of BOTH57M and Motion-X. Then sample on the test set of BOTH57M dataset (left) and Motion-X dataset (right). The poses that follow the conditions are circled red.
  • ...and 10 more figures