OOD-HOI: Text-Driven 3D Whole-Body Human-Object Interactions Generation Beyond Training Domains
Yixuan Zhang, Hui Yang, Chuanchen Luo, Junran Peng, Yuxi Wang, Zhaoxiang Zhang
TL;DR
OOD-HOI addresses the challenge of generating realistic whole-body human-object interactions from text in out-of-domain scenarios. It introduces a three-component pipeline: a dual-branch reciprocal diffusion model to jointly generate human and object poses, a contact-guided interaction refiner to enforce physical plausibility through a guidance function that minimizes floating and interpenetration, and a dynamic adaptation module that enhances generalization via semantic adjustment and geometry deformation, including the explicit pose representation $x_0=\{x^h_0, x^o_0\}$ with $x^h_0\in\mathbb{R}^{159}$ and $x^o_0\in\mathbb{R}^{6}$. The method demonstrates improved accuracy, lower FID, and better handling of Text-OOD and Object-OOD prompts across GRAB and HO-3D datasets, outperforming five strong baselines. The contributions—reciprocal diffusion for coherent body-object coupling, inference-time contact-guided refinement, and robust OOD strategies—enable more realistic and diverse 3D HOI generation with practical implications for VR/AR, robotics, and animation.
Abstract
Generating realistic 3D human-object interactions (HOIs) from text descriptions is a active research topic with potential applications in virtual and augmented reality, robotics, and animation. However, creating high-quality 3D HOIs remains challenging due to the lack of large-scale interaction data and the difficulty of ensuring physical plausibility, especially in out-of-domain (OOD) scenarios. Current methods tend to focus either on the body or the hands, which limits their ability to produce cohesive and realistic interactions. In this paper, we propose OOD-HOI, a text-driven framework for generating whole-body human-object interactions that generalize well to new objects and actions. Our approach integrates a dual-branch reciprocal diffusion model to synthesize initial interaction poses, a contact-guided interaction refiner to improve physical accuracy based on predicted contact areas, and a dynamic adaptation mechanism which includes semantic adjustment and geometry deformation to improve robustness. Experimental results demonstrate that our OOD-HOI could generate more realistic and physically plausible 3D interaction pose in OOD scenarios compared to existing methods.
