Text2HOI: Text-guided 3D Motion Generation for Hand-Object Interaction

Junuk Cha; Jihyeon Kim; Jae Shin Yoon; Seungryul Baek

Text2HOI: Text-guided 3D Motion Generation for Hand-Object Interaction

Junuk Cha, Jihyeon Kim, Jae Shin Yoon, Seungryul Baek

TL;DR

Text2HOI introduces a pioneering framework for text-guided 3D hand-object interaction by decomposing the problem into contact-map prediction and motion generation, followed by a lightweight hand refinement stage. A VAE-based contact predictor produces scale-aware, object-agnostic contact maps conditioned on text, while a Transformer-based diffusion model generates physically plausible hand-object motions guided by these maps and textual prompts. A dedicated refiner further improves contact realism and suppresses penetrations, enabling realistic interactions even with unseen objects. Experiments on H2O, GRAB, and ARCTIC demonstrate superior realism, diversity, and accuracy over baselines, with fast inference and publicly released datasets and code, providing a solid foundation for future research in text-driven 3D interaction generation.

Abstract

This paper introduces the first text-guided work for generating the sequence of hand-object interaction in 3D. The main challenge arises from the lack of labeled data where existing ground-truth datasets are nowhere near generalizable in interaction type and object category, which inhibits the modeling of diverse 3D hand-object interaction with the correct physical implication (e.g., contacts and semantics) from text prompts. To address this challenge, we propose to decompose the interaction generation task into two subtasks: hand-object contact generation; and hand-object motion generation. For contact generation, a VAE-based network takes as input a text and an object mesh, and generates the probability of contacts between the surfaces of hands and the object during the interaction. The network learns a variety of local geometry structure of diverse objects that is independent of the objects' category, and thus, it is applicable to general objects. For motion generation, a Transformer-based diffusion model utilizes this 3D contact map as a strong prior for generating physically plausible hand-object motion as a function of text prompts by learning from the augmented labeled dataset; where we annotate text labels from many existing 3D hand and object motion data. Finally, we further introduce a hand refiner module that minimizes the distance between the object surface and hand joints to improve the temporal stability of the object-hand contacts and to suppress the penetration artifacts. In the experiments, we demonstrate that our method can generate more realistic and diverse interactions compared to other baseline methods. We also show that our method is applicable to unseen objects. We will release our model and newly labeled data as a strong foundation for future research. Codes and data are available in: https://github.com/JunukCha/Text2HOI.

Text2HOI: Text-guided 3D Motion Generation for Hand-Object Interaction

TL;DR

Abstract

Paper Structure (28 sections, 12 equations, 15 figures, 12 tables)

This paper contains 28 sections, 12 equations, 15 figures, 12 tables.

Introduction
Related Work
Method
Contact map prediction
Text-to-3D hand-object motion generation
Preliminaries.
Forward process.
Backward process.
Hand refinement network
Experiments
Implementation details
Dataset
Evaluation metrics and baselines
Experimental results
Ablation study
...and 13 more sections

Figures (15)

Figure 1: Given a text and a canonical object mesh as prompts, we generate 3D motion for hand-object interaction without requiring object trajectory and initial hand pose. We represent the right hand with a light skin color and the left hand with a dark skin color. The articulation of a box in the first row is controlled by estimating an angle for the pre-defined axis of the box.
Figure 2: Schematic diagram of the overall framework. Given a text prompt and a canonical object mesh prompt, our aim is to generate the 3D motion for hand-object interaction. We first generate a contact map from the canonical object mesh conditioned by the text prompt and object's scale. The hand-object motion generation module removes the noise from the inputs for the denoised outputs to align with the predicted contact map and the text prompt. The denoised outputs exhibit artifacts, including the penetration. To address these artifacts, the hand refinement module adjusts the generated (denoised) hand pose parameters to restrain the penetration and to improve contact interactions.
Figure 3: The details of the text-to-3D hand-object motion generation in our framework. In the forward process, we generate the noised motion $\{\mathbf{x}^l_t\}^{\hat{L}}_{l=1}$ by adding the noise to the original (ground-truth) motion $\{\mathbf{x}^l_0\}^{\hat{L}}_{l=1}$. In the backward process, the Transformer encoder denoises the noised motion $\{\mathbf{x}^l_t\}^{\hat{L}}_{l=1}$, using various conditions $c$ including text features $f^\text{CLIP}(\mathbf{T})$, contact map $\hat{\mathbf{m}}_\text{contact}$, object features $\mathbf{F}_\text{obj}$, and object's scale $s_\text{obj}$. The right panel illustrates a comparison between conventional positional encoding, which can only differentiate each patch, and our proposed encoding, which provides detailed differentiation of both frames and agents. A unique positional encoding value is assigned for each box, distinguished by different colors.
Figure 4: We compare our generated hand-object motions with other baselines' results. Each row show the results of Text2Motion guo2022generating, MDM tevet2023human, IMOS ghosh2023imos, and ours.
Figure 5: We demonstrate the generated hand-object motions and the predicted contact map results. The first and second rows show the results with objects seen during training. The third and fourth rows show the results with objects unseen during training.
...and 10 more figures

Text2HOI: Text-guided 3D Motion Generation for Hand-Object Interaction

TL;DR

Abstract

Text2HOI: Text-guided 3D Motion Generation for Hand-Object Interaction

Authors

TL;DR

Abstract

Table of Contents

Figures (15)