HOI-Diff: Text-Driven Synthesis of 3D Human-Object Interactions using Diffusion Models
Xiaogang Peng, Yiming Xie, Zizhao Wu, Varun Jampani, Deqing Sun, Huaizu Jiang
TL;DR
HOI-Diff tackles text-driven synthesis of 3D human-object interactions by decomposing the task into three modular diffusion-based components: (1) coarse HOI generation via a Dual-Branch Diffusion Model with cross-attention communication between human and object motion branches, (2) independent Affordance Prediction Diffusion Model to estimate contact points, and (3) affordance-guided interaction correction using classifier guidance to ensure close, plausible contacts. The approach leverages a pre-trained human motion prior, explicit object motion handling, and LLM-assisted object-state reasoning to improve realism and diversity, validated on BEHAVE and OMOMO with annotated text descriptions. Quantitative results show state-of-the-art motion fidelity (FID, R-Precision, Diversity) and physically plausible contacts (Contact Distance, reduced penetration), as well as strong generalization to unseen objects. The work delivers a practical pipeline for text-conditioned 3D HOIs with dynamic objects, enabling broader AR/VR, gaming, and cinematic applications, while highlighting data- and affordance-related challenges for future improvement.
Abstract
We address the problem of generating realistic 3D human-object interactions (HOIs) driven by textual prompts. To this end, we take a modular design and decompose the complex task into simpler sub-tasks. We first develop a dual-branch diffusion model (HOI-DM) to generate both human and object motions conditioned on the input text, and encourage coherent motions by a cross-attention communication module between the human and object motion generation branches. We also develop an affordance prediction diffusion model (APDM) to predict the contacting area between the human and object during the interactions driven by the textual prompt. The APDM is independent of the results by the HOI-DM and thus can correct potential errors by the latter. Moreover, it stochastically generates the contacting points to diversify the generated motions. Finally, we incorporate the estimated contacting points into the classifier-guidance to achieve accurate and close contact between humans and objects. To train and evaluate our approach, we annotate BEHAVE dataset with text descriptions. Experimental results on BEHAVE and OMOMO demonstrate that our approach produces realistic HOIs with various interactions and different types of objects.
