Multimodal Priors-Augmented Text-Driven 3D Human-Object Interaction Generation
Yin Wang, Ziyao Zhang, Zhiying Leng, Haitian Liu, Frederick W. B. Li, Mu Li, Xiaohui Liang
TL;DR
This work tackles text-driven 3D human–object interaction generation by introducing MP-HOI, a diffusion-based framework that leverages multimodal priors (textual, visual, and spatial) to guide both human and object motions. It enhances object representation with geometric keypoints, contact cues, and dynamic properties, and employs a modality-aware Mixture-of-Experts to fuse multimodal features. A cascaded diffusion strategy progressively refines human, object, and then HOI interactions under dedicated supervision, yielding high-fidelity, fine-grained HOI motions that align with prompts. Extensive experiments on FullBodyManipulation and HIMO demonstrate state-of-the-art performance in motion quality, interaction realism, and prompt fidelity, with strong generalization to unseen objects and informative ablations confirming each component’s contribution.
Abstract
We address the challenging task of text-driven 3D human-object interaction (HOI) motion generation. Existing methods primarily rely on a direct text-to-HOI mapping, which suffers from three key limitations due to the significant cross-modality gap: (Q1) sub-optimal human motion, (Q2) unnatural object motion, and (Q3) weak interaction between humans and objects. To address these challenges, we propose MP-HOI, a novel framework grounded in four core insights: (1) Multimodal Data Priors: We leverage multimodal data (text, image, pose/object) from large multimodal models as priors to guide HOI generation, which tackles Q1 and Q2 in data modeling. (2) Enhanced Object Representation: We improve existing object representations by incorporating geometric keypoints, contact features, and dynamic properties, enabling expressive object representations, which tackles Q2 in data representation. (3) Multimodal-Aware Mixture-of-Experts (MoE) Model: We propose a modality-aware MoE model for effective multimodal feature fusion paradigm, which tackles Q1 and Q2 in feature fusion. (4) Cascaded Diffusion with Interaction Supervision: We design a cascaded diffusion framework that progressively refines human-object interaction features under dedicated supervision, which tackles Q3 in interaction refinement. Comprehensive experiments demonstrate that MP-HOI outperforms existing approaches in generating high-fidelity and fine-grained HOI motions.
