Table of Contents
Fetching ...

HOI-Diff: Text-Driven Synthesis of 3D Human-Object Interactions using Diffusion Models

Xiaogang Peng, Yiming Xie, Zizhao Wu, Varun Jampani, Deqing Sun, Huaizu Jiang

TL;DR

HOI-Diff tackles text-driven synthesis of 3D human-object interactions by decomposing the task into three modular diffusion-based components: (1) coarse HOI generation via a Dual-Branch Diffusion Model with cross-attention communication between human and object motion branches, (2) independent Affordance Prediction Diffusion Model to estimate contact points, and (3) affordance-guided interaction correction using classifier guidance to ensure close, plausible contacts. The approach leverages a pre-trained human motion prior, explicit object motion handling, and LLM-assisted object-state reasoning to improve realism and diversity, validated on BEHAVE and OMOMO with annotated text descriptions. Quantitative results show state-of-the-art motion fidelity (FID, R-Precision, Diversity) and physically plausible contacts (Contact Distance, reduced penetration), as well as strong generalization to unseen objects. The work delivers a practical pipeline for text-conditioned 3D HOIs with dynamic objects, enabling broader AR/VR, gaming, and cinematic applications, while highlighting data- and affordance-related challenges for future improvement.

Abstract

We address the problem of generating realistic 3D human-object interactions (HOIs) driven by textual prompts. To this end, we take a modular design and decompose the complex task into simpler sub-tasks. We first develop a dual-branch diffusion model (HOI-DM) to generate both human and object motions conditioned on the input text, and encourage coherent motions by a cross-attention communication module between the human and object motion generation branches. We also develop an affordance prediction diffusion model (APDM) to predict the contacting area between the human and object during the interactions driven by the textual prompt. The APDM is independent of the results by the HOI-DM and thus can correct potential errors by the latter. Moreover, it stochastically generates the contacting points to diversify the generated motions. Finally, we incorporate the estimated contacting points into the classifier-guidance to achieve accurate and close contact between humans and objects. To train and evaluate our approach, we annotate BEHAVE dataset with text descriptions. Experimental results on BEHAVE and OMOMO demonstrate that our approach produces realistic HOIs with various interactions and different types of objects.

HOI-Diff: Text-Driven Synthesis of 3D Human-Object Interactions using Diffusion Models

TL;DR

HOI-Diff tackles text-driven synthesis of 3D human-object interactions by decomposing the task into three modular diffusion-based components: (1) coarse HOI generation via a Dual-Branch Diffusion Model with cross-attention communication between human and object motion branches, (2) independent Affordance Prediction Diffusion Model to estimate contact points, and (3) affordance-guided interaction correction using classifier guidance to ensure close, plausible contacts. The approach leverages a pre-trained human motion prior, explicit object motion handling, and LLM-assisted object-state reasoning to improve realism and diversity, validated on BEHAVE and OMOMO with annotated text descriptions. Quantitative results show state-of-the-art motion fidelity (FID, R-Precision, Diversity) and physically plausible contacts (Contact Distance, reduced penetration), as well as strong generalization to unseen objects. The work delivers a practical pipeline for text-conditioned 3D HOIs with dynamic objects, enabling broader AR/VR, gaming, and cinematic applications, while highlighting data- and affordance-related challenges for future improvement.

Abstract

We address the problem of generating realistic 3D human-object interactions (HOIs) driven by textual prompts. To this end, we take a modular design and decompose the complex task into simpler sub-tasks. We first develop a dual-branch diffusion model (HOI-DM) to generate both human and object motions conditioned on the input text, and encourage coherent motions by a cross-attention communication module between the human and object motion generation branches. We also develop an affordance prediction diffusion model (APDM) to predict the contacting area between the human and object during the interactions driven by the textual prompt. The APDM is independent of the results by the HOI-DM and thus can correct potential errors by the latter. Moreover, it stochastically generates the contacting points to diversify the generated motions. Finally, we incorporate the estimated contacting points into the classifier-guidance to achieve accurate and close contact between humans and objects. To train and evaluate our approach, we annotate BEHAVE dataset with text descriptions. Experimental results on BEHAVE and OMOMO demonstrate that our approach produces realistic HOIs with various interactions and different types of objects.
Paper Structure (29 sections, 13 equations, 18 figures, 6 tables, 1 algorithm)

This paper contains 29 sections, 13 equations, 18 figures, 6 tables, 1 algorithm.

Figures (18)

  • Figure 1: HOI-Diffcan generate realistic motions for 3D human-object interactions given a text prompt and object geometry. Please see the supplementary material for video results. Darker color indicates later frames in the sequence. Best viewed in color.
  • Figure 2: Overview of HOI-Diff for 3D HOIs generation using diffusion models. Our key insight is to decompose the generation task into three modules: (a) coarse 3D HOI generation using a dual-branch diffusion model (DBDM), (b) affordance prediction diffusion model (APDM) to estimate the contacting points of humans and objects, and (c) affordance-guided interaction correction, which incorporates the estimated contacting information and employs the classifier-guidance to achieve accurate and close contact between humans and objects to form coherent HOIs.
  • Figure 3: Illustration of DBDM architecture for coarse 3D HOIs generation. It has two branches designed for generating human and object motions individually. A mutual cross-attention is introduced to allow information exchange between two branches to generate coherent motions. The human motion model $M^{h}$ finetunes a pretrained MDM tevet2023human.
  • Figure 4: Illustration of APDM architecture for affordance estimation. Affordance information of human contact labels, object contact positions, and binary object states are represented together as a noise variable, which is fed into the Transformer encoder to generate clean estimation. The object point cloud and textual prompt are taken as conditional input.
  • Figure 5: Qualitative comparisons of our approach and baselines on BEHAVE dataset. The bottom row, showcasing our method, demonstrates the generation of realistic 3D HOIs with plausible contacts, particularly evident in columns 2 and 4. This contrasts with the baselines, which fail to achieve a similar level of realism and contact plausibility in the interactions. As an additional visual aid, the mesh color gradually darkens over time to represent progression. (Best viewed in color.)
  • ...and 13 more figures