Table of Contents
Fetching ...

Decoupled Generative Modeling for Human-Object Interaction Synthesis

Hwanhee Jung, Seunggwan Lee, Jeongyoon Yoon, SeungHyeon Kim, Giljoo Nam, Qixing Huang, Sangpil Kim

TL;DR

DecHOI tackles realistic human-object interaction synthesis by decoupling path planning from action generation, enabling waypoint-free trajectory generation and detailed motion conditioned on learned paths. It introduces a diffusion-based trajectory generator and a separate action generator, augmented with a distal-joint adversarial discriminator and a dynamic planner (DynaPlan) for long-horizon, scene-aware planning in dynamic environments. Across FullBodyManipulation and unseen 3D-FUTURE objects, it achieves state-of-the-art quantitative and qualitative results and is favorably viewed in user studies for text alignment and interaction realism. The approach reduces optimization complexity, improves contact realism, and supports reactive planning in multi-agent scenarios, advancing HOI synthesis for practical 3D vision and robotics applications.

Abstract

Synthesizing realistic human-object interaction (HOI) is essential for 3D computer vision and robotics, underpinning animation and embodied control. Existing approaches often require manually specified intermediate waypoints and place all optimization objectives on a single network, which increases complexity, reduces flexibility, and leads to errors such as unsynchronized human and object motion or penetration. To address these issues, we propose Decoupled Generative Modeling for Human-Object Interaction Synthesis (DecHOI), which separates path planning and action synthesis. A trajectory generator first produces human and object trajectories without prescribed waypoints, and an action generator conditions on these paths to synthesize detailed motions. To further improve contact realism, we employ adversarial training with a discriminator that focuses on the dynamics of distal joints. The framework also models a moving counterpart and supports responsive, long-sequence planning in dynamic scenes, while preserving plan consistency. Across two benchmarks, FullBodyManipulation and 3D-FUTURE, DecHOI surpasses prior methods on most quantitative metrics and qualitative evaluations, and perceptual studies likewise prefer our results.

Decoupled Generative Modeling for Human-Object Interaction Synthesis

TL;DR

DecHOI tackles realistic human-object interaction synthesis by decoupling path planning from action generation, enabling waypoint-free trajectory generation and detailed motion conditioned on learned paths. It introduces a diffusion-based trajectory generator and a separate action generator, augmented with a distal-joint adversarial discriminator and a dynamic planner (DynaPlan) for long-horizon, scene-aware planning in dynamic environments. Across FullBodyManipulation and unseen 3D-FUTURE objects, it achieves state-of-the-art quantitative and qualitative results and is favorably viewed in user studies for text alignment and interaction realism. The approach reduces optimization complexity, improves contact realism, and supports reactive planning in multi-agent scenarios, advancing HOI synthesis for practical 3D vision and robotics applications.

Abstract

Synthesizing realistic human-object interaction (HOI) is essential for 3D computer vision and robotics, underpinning animation and embodied control. Existing approaches often require manually specified intermediate waypoints and place all optimization objectives on a single network, which increases complexity, reduces flexibility, and leads to errors such as unsynchronized human and object motion or penetration. To address these issues, we propose Decoupled Generative Modeling for Human-Object Interaction Synthesis (DecHOI), which separates path planning and action synthesis. A trajectory generator first produces human and object trajectories without prescribed waypoints, and an action generator conditions on these paths to synthesize detailed motions. To further improve contact realism, we employ adversarial training with a discriminator that focuses on the dynamics of distal joints. The framework also models a moving counterpart and supports responsive, long-sequence planning in dynamic scenes, while preserving plan consistency. Across two benchmarks, FullBodyManipulation and 3D-FUTURE, DecHOI surpasses prior methods on most quantitative metrics and qualitative evaluations, and perceptual studies likewise prefer our results.

Paper Structure

This paper contains 27 sections, 10 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Overview of DecHOI for dynamic human-object interaction synthesis. The framework decouples trajectory planning and interaction synthesis, enabling collision detection and responsive re-planning for realistic, contact-consistent motion.
  • Figure 2: Architecture of DecHOI showing the decoupled trajectory and action generation process. Conditioned on the text instruction, geometry, current human and object poses, and a goal point, the trajectory generator plans paths, while the action generator produces joint motions on these paths to yield synchronized, contact-aware interactions. The right panels detail the Trajectory and Action Generators.
  • Figure 3: Adversarial module of DecHOI, where a hand and foot-focused discriminator contrasts real and generated interactions to enhance contact realism.
  • Figure 4: Qualitative comparison of DecHOI with CHOIS li2024controllable and HOIFHLI wu2025human on the FullBodyManipulationli2023object. DecHOI produces stable contacts, smooth motion, and accurate object trajectories, while prior methods show drift, penetration, or inconsistent coordination between human and object motions.
  • Figure 5: Qualitative comparison of DecHOI and CHOIS li2024controllable on the 3D-FUTUREfu20213d, showing generalization to unseen objects.
  • ...and 3 more figures