Table of Contents
Fetching ...

OOD-HOI: Text-Driven 3D Whole-Body Human-Object Interactions Generation Beyond Training Domains

Yixuan Zhang, Hui Yang, Chuanchen Luo, Junran Peng, Yuxi Wang, Zhaoxiang Zhang

TL;DR

OOD-HOI addresses the challenge of generating realistic whole-body human-object interactions from text in out-of-domain scenarios. It introduces a three-component pipeline: a dual-branch reciprocal diffusion model to jointly generate human and object poses, a contact-guided interaction refiner to enforce physical plausibility through a guidance function that minimizes floating and interpenetration, and a dynamic adaptation module that enhances generalization via semantic adjustment and geometry deformation, including the explicit pose representation $x_0=\{x^h_0, x^o_0\}$ with $x^h_0\in\mathbb{R}^{159}$ and $x^o_0\in\mathbb{R}^{6}$. The method demonstrates improved accuracy, lower FID, and better handling of Text-OOD and Object-OOD prompts across GRAB and HO-3D datasets, outperforming five strong baselines. The contributions—reciprocal diffusion for coherent body-object coupling, inference-time contact-guided refinement, and robust OOD strategies—enable more realistic and diverse 3D HOI generation with practical implications for VR/AR, robotics, and animation.

Abstract

Generating realistic 3D human-object interactions (HOIs) from text descriptions is a active research topic with potential applications in virtual and augmented reality, robotics, and animation. However, creating high-quality 3D HOIs remains challenging due to the lack of large-scale interaction data and the difficulty of ensuring physical plausibility, especially in out-of-domain (OOD) scenarios. Current methods tend to focus either on the body or the hands, which limits their ability to produce cohesive and realistic interactions. In this paper, we propose OOD-HOI, a text-driven framework for generating whole-body human-object interactions that generalize well to new objects and actions. Our approach integrates a dual-branch reciprocal diffusion model to synthesize initial interaction poses, a contact-guided interaction refiner to improve physical accuracy based on predicted contact areas, and a dynamic adaptation mechanism which includes semantic adjustment and geometry deformation to improve robustness. Experimental results demonstrate that our OOD-HOI could generate more realistic and physically plausible 3D interaction pose in OOD scenarios compared to existing methods.

OOD-HOI: Text-Driven 3D Whole-Body Human-Object Interactions Generation Beyond Training Domains

TL;DR

OOD-HOI addresses the challenge of generating realistic whole-body human-object interactions from text in out-of-domain scenarios. It introduces a three-component pipeline: a dual-branch reciprocal diffusion model to jointly generate human and object poses, a contact-guided interaction refiner to enforce physical plausibility through a guidance function that minimizes floating and interpenetration, and a dynamic adaptation module that enhances generalization via semantic adjustment and geometry deformation, including the explicit pose representation with and . The method demonstrates improved accuracy, lower FID, and better handling of Text-OOD and Object-OOD prompts across GRAB and HO-3D datasets, outperforming five strong baselines. The contributions—reciprocal diffusion for coherent body-object coupling, inference-time contact-guided refinement, and robust OOD strategies—enable more realistic and diverse 3D HOI generation with practical implications for VR/AR, robotics, and animation.

Abstract

Generating realistic 3D human-object interactions (HOIs) from text descriptions is a active research topic with potential applications in virtual and augmented reality, robotics, and animation. However, creating high-quality 3D HOIs remains challenging due to the lack of large-scale interaction data and the difficulty of ensuring physical plausibility, especially in out-of-domain (OOD) scenarios. Current methods tend to focus either on the body or the hands, which limits their ability to produce cohesive and realistic interactions. In this paper, we propose OOD-HOI, a text-driven framework for generating whole-body human-object interactions that generalize well to new objects and actions. Our approach integrates a dual-branch reciprocal diffusion model to synthesize initial interaction poses, a contact-guided interaction refiner to improve physical accuracy based on predicted contact areas, and a dynamic adaptation mechanism which includes semantic adjustment and geometry deformation to improve robustness. Experimental results demonstrate that our OOD-HOI could generate more realistic and physically plausible 3D interaction pose in OOD scenarios compared to existing methods.

Paper Structure

This paper contains 17 sections, 5 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: We introduce OOD-HOI, a novel text-driven method for generating 3D human-object interactions in out-of-domain scenarios. OOD-HOI can generate realistic whole body human-object interaction poses directly from textual descriptions. Even when encountering unseen entities in instructions (highlighted in red), such as the action elevate or the object bottle, it produces physically plausible results.
  • Figure 2: Overview of OOD-HOI. Our approach decomposes the generation process into three module: (1) a dual-branch reciprocal diffusion model that exchanges information between human and object to generate an initial interaction pose, (2) a contact-guided interaction refiner is employed to revise the initial interaction human-object pose with additional inference-time guidance, (3) and a dynamic adaptation module designed for out-of-domain (OOD) generation, ensuring more realistic and physically plausible results.
  • Figure 3: Contact-Guided Interaction Refiner to conduct physical optimization. The refiner module takes text prompt, initial hand pose and object geometry as input, predicts the contact area between hand and object, and optimizes the floating object and interpenetration based on the predicted contact areas.
  • Figure 4: For geometry deformation, we propose a condition enhancement that deforms the object under a constraint of constant contact area. Since the primary contact for the airplane model typically occurs on its body, we apply controlled random deformations such as rotating the wings or stretching the nose within specified limits, which improve model robustness.
  • Figure 5: We compare our generated human-object interaction pose with other baseline results in GRAB dataset taheri2020grab. Each row show the results of Text2HOI cha2024text2hoi, IMoS ghosh2022imos, and Ours.
  • ...and 1 more figures