Table of Contents
Fetching ...

InterDreamer: Zero-Shot Text to 3D Dynamic Human-Object Interaction

Sirui Xu, Ziyin Wang, Yu-Xiong Wang, Liang-Yan Gui

TL;DR

InterDreamer tackles zero-shot text-guided generation of 3D HOIs by decoupling interaction semantics from low-level dynamics. It orchestrates high-level planning via LLMs to craft semantically aligned motion and initial object poses, with a vertex-based world model learning object dynamics from motion data. A dedicated optimization stage enforces physical plausibility and coherence, enabling realistic HOI sequences on BEHAVE and CHAIRS without text–interaction training data. The results show improvements in motion quality, interaction realism, and generalization, highlighting the framework's potential for flexible, text-driven HOI synthesis in real-world applications.

Abstract

Text-conditioned human motion generation has experienced significant advancements with diffusion models trained on extensive motion capture data and corresponding textual annotations. However, extending such success to 3D dynamic human-object interaction (HOI) generation faces notable challenges, primarily due to the lack of large-scale interaction data and comprehensive descriptions that align with these interactions. This paper takes the initiative and showcases the potential of generating human-object interactions without direct training on text-interaction pair data. Our key insight in achieving this is that interaction semantics and dynamics can be decoupled. Being unable to learn interaction semantics through supervised training, we instead leverage pre-trained large models, synergizing knowledge from a large language model and a text-to-motion model. While such knowledge offers high-level control over interaction semantics, it cannot grasp the intricacies of low-level interaction dynamics. To overcome this issue, we further introduce a world model designed to comprehend simple physics, modeling how human actions influence object motion. By integrating these components, our novel framework, InterDreamer, is able to generate text-aligned 3D HOI sequences in a zero-shot manner. We apply InterDreamer to the BEHAVE and CHAIRS datasets, and our comprehensive experimental analysis demonstrates its capability to generate realistic and coherent interaction sequences that seamlessly align with the text directives.

InterDreamer: Zero-Shot Text to 3D Dynamic Human-Object Interaction

TL;DR

InterDreamer tackles zero-shot text-guided generation of 3D HOIs by decoupling interaction semantics from low-level dynamics. It orchestrates high-level planning via LLMs to craft semantically aligned motion and initial object poses, with a vertex-based world model learning object dynamics from motion data. A dedicated optimization stage enforces physical plausibility and coherence, enabling realistic HOI sequences on BEHAVE and CHAIRS without text–interaction training data. The results show improvements in motion quality, interaction realism, and generalization, highlighting the framework's potential for flexible, text-driven HOI synthesis in real-world applications.

Abstract

Text-conditioned human motion generation has experienced significant advancements with diffusion models trained on extensive motion capture data and corresponding textual annotations. However, extending such success to 3D dynamic human-object interaction (HOI) generation faces notable challenges, primarily due to the lack of large-scale interaction data and comprehensive descriptions that align with these interactions. This paper takes the initiative and showcases the potential of generating human-object interactions without direct training on text-interaction pair data. Our key insight in achieving this is that interaction semantics and dynamics can be decoupled. Being unable to learn interaction semantics through supervised training, we instead leverage pre-trained large models, synergizing knowledge from a large language model and a text-to-motion model. While such knowledge offers high-level control over interaction semantics, it cannot grasp the intricacies of low-level interaction dynamics. To overcome this issue, we further introduce a world model designed to comprehend simple physics, modeling how human actions influence object motion. By integrating these components, our novel framework, InterDreamer, is able to generate text-aligned 3D HOI sequences in a zero-shot manner. We apply InterDreamer to the BEHAVE and CHAIRS datasets, and our comprehensive experimental analysis demonstrates its capability to generate realistic and coherent interaction sequences that seamlessly align with the text directives.
Paper Structure (20 sections, 13 equations, 10 figures, 4 tables)

This paper contains 20 sections, 13 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: InterDreamer can generate vivid 3D human-object interaction sequences guided by textual descriptions. Its zero-shot ability is achieved by integrating semantics and dynamics knowledge from large-scale text-motion data (upper left), a large language model (LLM) (bottom left), 3D human-object interaction database (upper middle), and interaction prior (bottom middle). We visualize the generated text-guided interaction sequence (upper right), with the beginning of the sequence unfolded (bottom right). More details are available in https://sirui-xu.github.io/InterDreamer/.
  • Figure 2: An overview of our InterDreamer.(i) Our high-level planning analyzes the description using LLMs and provides guidance to the low-level control. (ii) Our low-level control includes a text-to-motion model that translates text into human actions $\color{red} \boldsymbol{a}_{t+1}$, and an interaction retrieval model for extracting the object's initial pose as the first state $\color{blue}\boldsymbol{s}_1$. (iii) Our world model executes the actions and outputs the next state $\color{blue}\boldsymbol{s}_{t+1}$ through dynamics modeling. An optimization process is coupled with the dynamics model, projecting the state and action onto valid counterparts $\color{blue}\boldsymbol{s}_{t+1}^\ast$ and $\color{red}\boldsymbol{a}_{t+1}^\ast$. Solid arrows mean that the process is performed iteratively.
  • Figure 3: Qualitative results on the BEHAVE dataset bhatnagar22behave. The interaction sequences are presented through a time-series visualization where color changes denote progression through frames. Frames are separately visualized when the pelvis remains nearly static. Here, our synergized knowledge comes from GPT-4 chatgpt and MotionGPT jiang2023motiongpt.
  • Figure 4: Qualitative results in more challenge scenarios with free-form input not from our annotations, showing the ability of our InterDreamer to fit different object sizes and handle complex and long sequences. Here, our synergized knowledge comes from GPT-4 chatgpt and MotionGPT jiang2023motiongpt.
  • Figure 5: Qualitative results on the CHAIRS dataset jiang2022chairs. Our dynamics model trained on the BEHAVE dataset bhatnagar22behave generalizes well on the CHAIRS dataset unseen in training. Interaction sequences are visualized through a time-series style where color changes denote progression through frames. Frames are separately visualized. Here, high-level planning and low-level control use GPT-4 chatgpt and MotionGPT jiang2023motiongpt, respectively.
  • ...and 5 more figures