Table of Contents
Fetching ...

Zero-Shot Human-Object Interaction Synthesis with Multimodal Priors

Yuke Lou, Yiming Wang, Zhen Wu, Rui Zhao, Wenjia Wang, Mingyi Shi, Taku Komura

TL;DR

This work introduces a zero-shot framework for generating diverse and physically plausible human–object interactions without relying on 3D HOI datasets. It leverages large multimodal models to extract 2D HOI priors, lifts them to 3D by estimating 3D human poses and category-level 6-DoF object poses via semantic correspondences and differentiable rendering, and then refines the results with physics-based tracking in Isaac Gym guided by LLM-derived contact labels. The approach combines 2D-to-3D uplift, differentiable rendering, and RL-based motion tracking to produce open-vocabulary HOIs with realistic contact and collision avoidance, outperforming baselines in both qualitative and quantitative evaluations. The framework supports augmentation of real motions, HOI reconstruction from video, and automatic 3D HOI dataset generation, enabling scalable, diverse HOI synthesis for applications in VR, robotics, and synthetic data creation.

Abstract

Human-object interaction (HOI) synthesis is important for various applications, ranging from virtual reality to robotics. However, acquiring 3D HOI data is challenging due to its complexity and high cost, limiting existing methods to the narrow diversity of object types and interaction patterns in training datasets. This paper proposes a novel zero-shot HOI synthesis framework without relying on end-to-end training on currently limited 3D HOI datasets. The core idea of our method lies in leveraging extensive HOI knowledge from pre-trained Multimodal Models. Given a text description, our system first obtains temporally consistent 2D HOI image sequences using image or video generation models, which are then uplifted to 3D HOI milestones of human and object poses. We employ pre-trained human pose estimation models to extract human poses and introduce a generalizable category-level 6-DoF estimation method to obtain the object poses from 2D HOI images. Our estimation method is adaptive to various object templates obtained from text-to-3D models or online retrieval. A physics-based tracking of the 3D HOI kinematic milestone is further applied to refine both body motions and object poses, yielding more physically plausible HOI generation results. The experimental results demonstrate that our method is capable of generating open-vocabulary HOIs with physical realism and semantic diversity.

Zero-Shot Human-Object Interaction Synthesis with Multimodal Priors

TL;DR

This work introduces a zero-shot framework for generating diverse and physically plausible human–object interactions without relying on 3D HOI datasets. It leverages large multimodal models to extract 2D HOI priors, lifts them to 3D by estimating 3D human poses and category-level 6-DoF object poses via semantic correspondences and differentiable rendering, and then refines the results with physics-based tracking in Isaac Gym guided by LLM-derived contact labels. The approach combines 2D-to-3D uplift, differentiable rendering, and RL-based motion tracking to produce open-vocabulary HOIs with realistic contact and collision avoidance, outperforming baselines in both qualitative and quantitative evaluations. The framework supports augmentation of real motions, HOI reconstruction from video, and automatic 3D HOI dataset generation, enabling scalable, diverse HOI synthesis for applications in VR, robotics, and synthetic data creation.

Abstract

Human-object interaction (HOI) synthesis is important for various applications, ranging from virtual reality to robotics. However, acquiring 3D HOI data is challenging due to its complexity and high cost, limiting existing methods to the narrow diversity of object types and interaction patterns in training datasets. This paper proposes a novel zero-shot HOI synthesis framework without relying on end-to-end training on currently limited 3D HOI datasets. The core idea of our method lies in leveraging extensive HOI knowledge from pre-trained Multimodal Models. Given a text description, our system first obtains temporally consistent 2D HOI image sequences using image or video generation models, which are then uplifted to 3D HOI milestones of human and object poses. We employ pre-trained human pose estimation models to extract human poses and introduce a generalizable category-level 6-DoF estimation method to obtain the object poses from 2D HOI images. Our estimation method is adaptive to various object templates obtained from text-to-3D models or online retrieval. A physics-based tracking of the 3D HOI kinematic milestone is further applied to refine both body motions and object poses, yielding more physically plausible HOI generation results. The experimental results demonstrate that our method is capable of generating open-vocabulary HOIs with physical realism and semantic diversity.

Paper Structure

This paper contains 44 sections, 11 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Our system is composed of two core components: (a) a zero-shot HOI generation pipeline that leverages the generative capabilities of pre-trained multimodal models to obtain rough 3D interaction between humans and objects from text input; (b) a physics-based tracking strategy applied to the HOI generated in part (a) to produce physically plausible animations.
  • Figure 2: Left: Rectified Pose. Right: Initial Pose. Human motions generated by Text-to-Motion models may lack spatial awareness of objects, which limits the effectiveness of subsequent human-object interaction optimization. For instance, given the prompt "A man is playing the guitar", the generated human body motion fails to provide sufficient space for a plausible guitar placement. Additional examples illustrating the benefits of motion rectification are provided in Fig. \ref{['fig:rec']}.
  • Figure 3: We train a control policy in Isaac Gym to mimic the reference motion.
  • Figure 4: Zero-shot human-object interaction results generated by our system, using generative 2D Image pipeline.
  • Figure 5: Zero-shot human-object interaction results generated by our system, using video generation models.
  • ...and 8 more figures