CG-HOI: Contact-Guided 3D Human-Object Interaction Generation
Christian Diller, Angela Dai
TL;DR
CG-HOI tackles generating dynamic 3D human–object interactions from text by jointly modeling full-body human motion, object motion, and body–object contact within a diffusion framework. The method introduces a three-way diffusion with cross-attention to learn interdependencies and uses a contact-based weighting and diffusion guidance to enforce physical plausibility, enabling realistic HOIs from text descriptions and object geometry. It demonstrates strong results on BEHAVE and CHAIRS, including applications such as motion conditioned on an object trajectory and populating static 3D scenes, while providing ablations that highlight the contributions of contact modeling, cross-attention, and inference-time guidance. The work advances interactive 3D animation, robotics planning, and scene understanding by delivering semantically coherent, physically plausible HOIs without re-training for new object trajectories or static environments.
Abstract
We propose CG-HOI, the first method to address the task of generating dynamic 3D human-object interactions (HOIs) from text. We model the motion of both human and object in an interdependent fashion, as semantically rich human motion rarely happens in isolation without any interactions. Our key insight is that explicitly modeling contact between the human body surface and object geometry can be used as strong proxy guidance, both during training and inference. Using this guidance to bridge human and object motion enables generating more realistic and physically plausible interaction sequences, where the human body and corresponding object move in a coherent manner. Our method first learns to model human motion, object motion, and contact in a joint diffusion process, inter-correlated through cross-attention. We then leverage this learned contact for guidance during inference to synthesize realistic and coherent HOIs. Extensive evaluation shows that our joint contact-based human-object interaction approach generates realistic and physically plausible sequences, and we show two applications highlighting the capabilities of our method. Conditioned on a given object trajectory, we can generate the corresponding human motion without re-training, demonstrating strong human-object interdependency learning. Our approach is also flexible, and can be applied to static real-world 3D scene scans.
