Table of Contents
Fetching ...

Controllable Human-Object Interaction Synthesis

Jiaman Li, Alexander Clegg, Roozbeh Mottaghi, Jiajun Wu, Xavier Puig, C. Karen Liu

TL;DR

CHOIS tackles the problem of language-guided, long-horizon human–object interaction synthesis in 3D scenes by jointly generating synchronized object and human motion using a conditional diffusion model. It introduces an object geometry loss during training and multiple guidance terms during sampling to enforce realistic hand–object contact and grounding to sparse object waypoints, enabling integration with path planning for extended interactions. Evaluations on FullBodyManipulation and 3D-FUTURE show improved condition matching and interaction quality, with ablations confirming the benefits of geometry supervision and guidance terms. The approach supports long-term, scene-aware interaction generation and offers a practical pipeline for animation, robotics, and embodied AI tasks.

Abstract

Synthesizing semantic-aware, long-horizon, human-object interaction is critical to simulate realistic human behaviors. In this work, we address the challenging problem of generating synchronized object motion and human motion guided by language descriptions in 3D scenes. We propose Controllable Human-Object Interaction Synthesis (CHOIS), an approach that generates object motion and human motion simultaneously using a conditional diffusion model given a language description, initial object and human states, and sparse object waypoints. Here, language descriptions inform style and intent, and waypoints, which can be effectively extracted from high-level planning, ground the motion in the scene. Naively applying a diffusion model fails to predict object motion aligned with the input waypoints; it also cannot ensure the realism of interactions that require precise hand-object and human-floor contact. To overcome these problems, we introduce an object geometry loss as additional supervision to improve the matching between generated object motion and input object waypoints; we also design guidance terms to enforce contact constraints during the sampling process of the trained diffusion model. We demonstrate that our learned interaction module can synthesize realistic human-object interactions, adhering to provided textual descriptions and sparse waypoint conditions. Additionally, our module seamlessly integrates with a path planning module, enabling the generation of long-term interactions in 3D environments.

Controllable Human-Object Interaction Synthesis

TL;DR

CHOIS tackles the problem of language-guided, long-horizon human–object interaction synthesis in 3D scenes by jointly generating synchronized object and human motion using a conditional diffusion model. It introduces an object geometry loss during training and multiple guidance terms during sampling to enforce realistic hand–object contact and grounding to sparse object waypoints, enabling integration with path planning for extended interactions. Evaluations on FullBodyManipulation and 3D-FUTURE show improved condition matching and interaction quality, with ablations confirming the benefits of geometry supervision and guidance terms. The approach supports long-term, scene-aware interaction generation and offers a practical pipeline for animation, robotics, and embodied AI tasks.

Abstract

Synthesizing semantic-aware, long-horizon, human-object interaction is critical to simulate realistic human behaviors. In this work, we address the challenging problem of generating synchronized object motion and human motion guided by language descriptions in 3D scenes. We propose Controllable Human-Object Interaction Synthesis (CHOIS), an approach that generates object motion and human motion simultaneously using a conditional diffusion model given a language description, initial object and human states, and sparse object waypoints. Here, language descriptions inform style and intent, and waypoints, which can be effectively extracted from high-level planning, ground the motion in the scene. Naively applying a diffusion model fails to predict object motion aligned with the input waypoints; it also cannot ensure the realism of interactions that require precise hand-object and human-floor contact. To overcome these problems, we introduce an object geometry loss as additional supervision to improve the matching between generated object motion and input object waypoints; we also design guidance terms to enforce contact constraints during the sampling process of the trained diffusion model. We demonstrate that our learned interaction module can synthesize realistic human-object interactions, adhering to provided textual descriptions and sparse waypoint conditions. Additionally, our module seamlessly integrates with a path planning module, enabling the generation of long-term interactions in 3D environments.
Paper Structure (13 sections, 9 equations, 6 figures, 4 tables)

This paper contains 13 sections, 9 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Given an initial object and human state, a language description, and sparse object waypoints in a 3D scene, CHOIS generates synchronized object motion and human motion at the same time.
  • Figure 2: Method Overview. Given an object geometry, we use the BPS representation to encode the geometry and an MLP to project the features into a low-dimensional vector. This feature vector is concatenated with masked pose states to form conditions for the denoising network. During sampling, we use analytical functions to compute gradients and perturb the generation to satisfy our defined constraints.
  • Figure 3: Qualitative results of the FullBodyManipulation dataset li2023object.
  • Figure 4: Results of human perceptual studies. The numbers shown in the chart represent the percentage (%) over motion preferences.
  • Figure 5: Long-term interaction synthesis. Given language descriptions, a 3D scene with semantic labels, and initial human and object states, we synthesize long-term human-object interactions. The initial state is shown in green.
  • ...and 1 more figures