Multimodal Priors-Augmented Text-Driven 3D Human-Object Interaction Generation

Yin Wang; Ziyao Zhang; Zhiying Leng; Haitian Liu; Frederick W. B. Li; Mu Li; Xiaohui Liang

Multimodal Priors-Augmented Text-Driven 3D Human-Object Interaction Generation

Yin Wang, Ziyao Zhang, Zhiying Leng, Haitian Liu, Frederick W. B. Li, Mu Li, Xiaohui Liang

TL;DR

This work tackles text-driven 3D human–object interaction generation by introducing MP-HOI, a diffusion-based framework that leverages multimodal priors (textual, visual, and spatial) to guide both human and object motions. It enhances object representation with geometric keypoints, contact cues, and dynamic properties, and employs a modality-aware Mixture-of-Experts to fuse multimodal features. A cascaded diffusion strategy progressively refines human, object, and then HOI interactions under dedicated supervision, yielding high-fidelity, fine-grained HOI motions that align with prompts. Extensive experiments on FullBodyManipulation and HIMO demonstrate state-of-the-art performance in motion quality, interaction realism, and prompt fidelity, with strong generalization to unseen objects and informative ablations confirming each component’s contribution.

Abstract

We address the challenging task of text-driven 3D human-object interaction (HOI) motion generation. Existing methods primarily rely on a direct text-to-HOI mapping, which suffers from three key limitations due to the significant cross-modality gap: (Q1) sub-optimal human motion, (Q2) unnatural object motion, and (Q3) weak interaction between humans and objects. To address these challenges, we propose MP-HOI, a novel framework grounded in four core insights: (1) Multimodal Data Priors: We leverage multimodal data (text, image, pose/object) from large multimodal models as priors to guide HOI generation, which tackles Q1 and Q2 in data modeling. (2) Enhanced Object Representation: We improve existing object representations by incorporating geometric keypoints, contact features, and dynamic properties, enabling expressive object representations, which tackles Q2 in data representation. (3) Multimodal-Aware Mixture-of-Experts (MoE) Model: We propose a modality-aware MoE model for effective multimodal feature fusion paradigm, which tackles Q1 and Q2 in feature fusion. (4) Cascaded Diffusion with Interaction Supervision: We design a cascaded diffusion framework that progressively refines human-object interaction features under dedicated supervision, which tackles Q3 in interaction refinement. Comprehensive experiments demonstrate that MP-HOI outperforms existing approaches in generating high-fidelity and fine-grained HOI motions.

Multimodal Priors-Augmented Text-Driven 3D Human-Object Interaction Generation

TL;DR

Abstract

Paper Structure (31 sections, 9 equations, 8 figures, 3 tables)

This paper contains 31 sections, 9 equations, 8 figures, 3 tables.

Introduction
Related Work
Text-Driven Human Motion Generation
Text-Driven Human-Object Interaction Generation
Large Model-Assisted Motion Generation
Preliminarily
Methodology
Overview
Data Representation
Multimodal Priors
Human/Object Motion Diffusion Process
Human/Object Motion Diffusion Model
Modality-aware Mixture-of-Experts Models
Human/Object Motion Diffusion Training Objective
Human-Object Interaction Diffusion Process
...and 16 more sections

Figures (8)

Figure 1: MP-HOI excels in generating fine-grained human-object interaction motions from multimodal data priors, achieving both high-quality human-object interactions and precise text-motion alignment.
Figure 2: Overview of MP-HOI. Given a text prompt and multimodal priors, the reverse denoising process of the Human Motion Diffusion Model and Object Motion Diffusion Model starts from noisy motion data $H_T$ and $O_T$, generating clean human and object motion data ($H_0$ and $O_0$). Then, the Human-Object Interaction Diffusion Model takes the text prompt and the clean human and object motion data ($H_0$ and $O_0$) as inputs, and generates the final clean human-object interaction motion data $X_0$.
Figure 3: The pipeline for large models processing multimodal data (text and image).
Figure 4: Illustration of the overall object motion diffusion pipeline. (a) Object diffusion process. (b) Object motion diffusion model. (c) Architecture of Modality-aware MoE Models. Notably, the Human Motion Diffusion Model adopts this identical architecture, with the object geometry feature $C_p^f$ replaced by the atomic motion feature $C_a^f$.
Figure 5: Illustration of the overall human-object interaction motion diffusion pipeline. (a) Human-object interaction motion diffusion process. (b) Architecture of human-object interaction diffusion model. Pre-Human Motion and Pre-Object Motion represent the human motion and object motion generated in the human/object motion diffusion process, respectively.
...and 3 more figures

Multimodal Priors-Augmented Text-Driven 3D Human-Object Interaction Generation

TL;DR

Abstract

Multimodal Priors-Augmented Text-Driven 3D Human-Object Interaction Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)