Table of Contents
Fetching ...

CG-HOI: Contact-Guided 3D Human-Object Interaction Generation

Christian Diller, Angela Dai

TL;DR

CG-HOI tackles generating dynamic 3D human–object interactions from text by jointly modeling full-body human motion, object motion, and body–object contact within a diffusion framework. The method introduces a three-way diffusion with cross-attention to learn interdependencies and uses a contact-based weighting and diffusion guidance to enforce physical plausibility, enabling realistic HOIs from text descriptions and object geometry. It demonstrates strong results on BEHAVE and CHAIRS, including applications such as motion conditioned on an object trajectory and populating static 3D scenes, while providing ablations that highlight the contributions of contact modeling, cross-attention, and inference-time guidance. The work advances interactive 3D animation, robotics planning, and scene understanding by delivering semantically coherent, physically plausible HOIs without re-training for new object trajectories or static environments.

Abstract

We propose CG-HOI, the first method to address the task of generating dynamic 3D human-object interactions (HOIs) from text. We model the motion of both human and object in an interdependent fashion, as semantically rich human motion rarely happens in isolation without any interactions. Our key insight is that explicitly modeling contact between the human body surface and object geometry can be used as strong proxy guidance, both during training and inference. Using this guidance to bridge human and object motion enables generating more realistic and physically plausible interaction sequences, where the human body and corresponding object move in a coherent manner. Our method first learns to model human motion, object motion, and contact in a joint diffusion process, inter-correlated through cross-attention. We then leverage this learned contact for guidance during inference to synthesize realistic and coherent HOIs. Extensive evaluation shows that our joint contact-based human-object interaction approach generates realistic and physically plausible sequences, and we show two applications highlighting the capabilities of our method. Conditioned on a given object trajectory, we can generate the corresponding human motion without re-training, demonstrating strong human-object interdependency learning. Our approach is also flexible, and can be applied to static real-world 3D scene scans.

CG-HOI: Contact-Guided 3D Human-Object Interaction Generation

TL;DR

CG-HOI tackles generating dynamic 3D human–object interactions from text by jointly modeling full-body human motion, object motion, and body–object contact within a diffusion framework. The method introduces a three-way diffusion with cross-attention to learn interdependencies and uses a contact-based weighting and diffusion guidance to enforce physical plausibility, enabling realistic HOIs from text descriptions and object geometry. It demonstrates strong results on BEHAVE and CHAIRS, including applications such as motion conditioned on an object trajectory and populating static 3D scenes, while providing ablations that highlight the contributions of contact modeling, cross-attention, and inference-time guidance. The work advances interactive 3D animation, robotics planning, and scene understanding by delivering semantically coherent, physically plausible HOIs without re-training for new object trajectories or static environments.

Abstract

We propose CG-HOI, the first method to address the task of generating dynamic 3D human-object interactions (HOIs) from text. We model the motion of both human and object in an interdependent fashion, as semantically rich human motion rarely happens in isolation without any interactions. Our key insight is that explicitly modeling contact between the human body surface and object geometry can be used as strong proxy guidance, both during training and inference. Using this guidance to bridge human and object motion enables generating more realistic and physically plausible interaction sequences, where the human body and corresponding object move in a coherent manner. Our method first learns to model human motion, object motion, and contact in a joint diffusion process, inter-correlated through cross-attention. We then leverage this learned contact for guidance during inference to synthesize realistic and coherent HOIs. Extensive evaluation shows that our joint contact-based human-object interaction approach generates realistic and physically plausible sequences, and we show two applications highlighting the capabilities of our method. Conditioned on a given object trajectory, we can generate the corresponding human motion without re-training, demonstrating strong human-object interdependency learning. Our approach is also flexible, and can be applied to static real-world 3D scene scans.
Paper Structure (35 sections, 9 equations, 12 figures, 5 tables)

This paper contains 35 sections, 9 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: We present an approach to generate realistic 3D human-object interactions (HOIs), from a text description and given static object geometry to be interacted with (left). Our main insight is to explicitly model contact (visualized as colors on the body mesh, closer contact in red), in tandem with human and object sequences, in a joint diffusion process. In addition to synthesizing HOIs from text, we can also synthesize human motions conditioned on given object trajectories (top right), and generate interactions in static scene scans (bottom right).
  • Figure 2: Method Overview. Given a text description and object geometry, CG-HOI produces a human-object interaction (HOI) sequence, modeling both human and object motion. To produce realistic HOIs, we additionally model contact to bridge the interdependent motions. Our method jointly generates all three during training (left), using a U-Net-based diffusion with cross-attention across human, object, and contact. During inference (right), we drive synthesis under guidance of estimated contact to sample more physically plausible interactions.
  • Figure 3: An object's trajectory is largely defined by the motion of the region of the body in close contact with the object, e.g. the hand(s) when carrying an object (left, middle) or the lower body when moving with an object while sitting (right). This informs our contact-based approach to generating object motion.
  • Figure 4: Qualitative comparison to state-of-the-art methods MDM tevet2023human and InterDiff xu2023interdiff. Our approach generates high-quality HOIs by jointly modeling contact (closer contact in red), reducing penetration and floating artifacts (black highlight boxes).
  • Figure 5: Perceptual User Study. Participants significantly favor our method over baselines, for overall realism and text coherence.
  • ...and 7 more figures