Table of Contents
Fetching ...

ViHOI: Human-Object Interaction Synthesis with Visual Priors

Songjin Cai, Linjie Zhong, Ling Guo, Changxing Ding

Abstract

Generating realistic and physically plausible 3D Human-Object Interactions (HOI) remains a key challenge in motion generation. One primary reason is that describing these physical constraints with words alone is difficult. To address this limitation, we propose a new paradigm: extracting rich interaction priors from easily accessible 2D images. Specifically, we introduce ViHOI, a novel framework that enables diffusion-based generative models to leverage rich, task-specific priors from 2D images to enhance generation quality. We utilize a large Vision-Language Model (VLM) as a powerful prior-extraction engine and adopt a layer-decoupled strategy to obtain visual and textual priors. Concurrently, we design a Q-Former-based adapter that compresses the VLM's high-dimensional features into compact prior tokens, which significantly facilitates the conditional training of our diffusion model. Our framework is trained on motion-rendered images from the dataset to ensure strict semantic alignment between visual inputs and motion sequences. During inference, it leverages reference images synthesized by a text-to-image generation model to improve generalization to unseen objects and interaction categories. Experimental results demonstrate that ViHOI achieves state-of-the-art performance, outperforming existing methods across multiple benchmarks and demonstrating superior generalization.

ViHOI: Human-Object Interaction Synthesis with Visual Priors

Abstract

Generating realistic and physically plausible 3D Human-Object Interactions (HOI) remains a key challenge in motion generation. One primary reason is that describing these physical constraints with words alone is difficult. To address this limitation, we propose a new paradigm: extracting rich interaction priors from easily accessible 2D images. Specifically, we introduce ViHOI, a novel framework that enables diffusion-based generative models to leverage rich, task-specific priors from 2D images to enhance generation quality. We utilize a large Vision-Language Model (VLM) as a powerful prior-extraction engine and adopt a layer-decoupled strategy to obtain visual and textual priors. Concurrently, we design a Q-Former-based adapter that compresses the VLM's high-dimensional features into compact prior tokens, which significantly facilitates the conditional training of our diffusion model. Our framework is trained on motion-rendered images from the dataset to ensure strict semantic alignment between visual inputs and motion sequences. During inference, it leverages reference images synthesized by a text-to-image generation model to improve generalization to unseen objects and interaction categories. Experimental results demonstrate that ViHOI achieves state-of-the-art performance, outperforming existing methods across multiple benchmarks and demonstrating superior generalization.

Paper Structure

This paper contains 35 sections, 4 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Qualitative results on the FullBodyManipulation dataset OMOMO. The three images on the left side are the reference inputs, while the right side shows the motion sequences generated from them. Despite imperfections in these reference images, the generated HOI motions remain plausible and well aligned with the textual semantics.
  • Figure 2: Overall architecture of ViHOI. We extract visual priors from a set of reference images and textual priors from the input prompt using a VLM. This allows for natural alignment between priors of the two modalities. Subsequently, two Q-Former-based prior adapters distill these high-dimensional priors into a single compact token, respectively, providing the diffusion model with semantically consistent conditioning signals. At each denoising step, a selected HOI generator uses these compact visual and textual prior tokens to guide the synthesis of realistic, semantically coherent human-object interactions.
  • Figure 2: User study on the FullBodyManipulation dataset OMOMO.
  • Figure 3: Illustration of strategies to obtain the set of reference images during the training and inference phases, respectively.
  • Figure 3: User study on the 3D-Future dataset 3D-Future.
  • ...and 6 more figures