Table of Contents
Fetching ...

HOI-Dyn: Learning Interaction Dynamics for Human-Object Motion Diffusion

Lin Wu, Zhixiang Chen, Jianglin Lan

TL;DR

This work reframes human–object interaction (HOI) generation as a driver–responder problem, where human actions drive object responses. It introduces HOI-Dyn, a lightweight transformer-based interaction dynamics model coupled with a residual-based dynamics loss to enforce causal object reactions during training, while keeping inference efficient. A conditional diffusion backbone jointly models human, object, and interaction context, with an auxiliary dynamics loss and horizon extension to capture varying interaction magnitudes. Experiments on FullBodyManipulation and 3D-FUTURE show state-of-the-art performance across multiple metrics, plus compelling 3D scene applications and a dynamics-based metric for causal evaluation. The approach demonstrates improved physical plausibility, temporal coherence, and contact realism, with practical implications for VR/AR, animation, and robotics, while outlining avenues for richer object representations and multi-agent scalability.

Abstract

Generating realistic 3D human-object interactions (HOIs) remains a challenging task due to the difficulty of modeling detailed interaction dynamics. Existing methods treat human and object motions independently, resulting in physically implausible and causally inconsistent behaviors. In this work, we present HOI-Dyn, a novel framework that formulates HOI generation as a driver-responder system, where human actions drive object responses. At the core of our method is a lightweight transformer-based interaction dynamics model that explicitly predicts how objects should react to human motion. To further enforce consistency, we introduce a residual-based dynamics loss that mitigates the impact of dynamics prediction errors and prevents misleading optimization signals. The dynamics model is used only during training, preserving inference efficiency. Through extensive qualitative and quantitative experiments, we demonstrate that our approach not only enhances the quality of HOI generation but also establishes a feasible metric for evaluating the quality of generated interactions.

HOI-Dyn: Learning Interaction Dynamics for Human-Object Motion Diffusion

TL;DR

This work reframes human–object interaction (HOI) generation as a driver–responder problem, where human actions drive object responses. It introduces HOI-Dyn, a lightweight transformer-based interaction dynamics model coupled with a residual-based dynamics loss to enforce causal object reactions during training, while keeping inference efficient. A conditional diffusion backbone jointly models human, object, and interaction context, with an auxiliary dynamics loss and horizon extension to capture varying interaction magnitudes. Experiments on FullBodyManipulation and 3D-FUTURE show state-of-the-art performance across multiple metrics, plus compelling 3D scene applications and a dynamics-based metric for causal evaluation. The approach demonstrates improved physical plausibility, temporal coherence, and contact realism, with practical implications for VR/AR, animation, and robotics, while outlining avenues for richer object representations and multi-agent scalability.

Abstract

Generating realistic 3D human-object interactions (HOIs) remains a challenging task due to the difficulty of modeling detailed interaction dynamics. Existing methods treat human and object motions independently, resulting in physically implausible and causally inconsistent behaviors. In this work, we present HOI-Dyn, a novel framework that formulates HOI generation as a driver-responder system, where human actions drive object responses. At the core of our method is a lightweight transformer-based interaction dynamics model that explicitly predicts how objects should react to human motion. To further enforce consistency, we introduce a residual-based dynamics loss that mitigates the impact of dynamics prediction errors and prevents misleading optimization signals. The dynamics model is used only during training, preserving inference efficiency. Through extensive qualitative and quantitative experiments, we demonstrate that our approach not only enhances the quality of HOI generation but also establishes a feasible metric for evaluating the quality of generated interactions.

Paper Structure

This paper contains 54 sections, 38 equations, 15 figures, 9 tables.

Figures (15)

  • Figure 1: Overview of the proposed HOI-Dyn framework. (a) Conditional Motion Diffusion synthesizes human-object interactions $\hat{\tau}_0 = \{\hat{H}, \hat{O}, \hat{X}\}$ using a Transformer-based diffusion model, where $\hat{H} := \{\hat{h}_t\}_{t=0}^{T-1}$ and $\hat{O} := \{\hat{o}_t\}_{t=0}^{T-1}$. (b) The full framework integrates motion generation with interaction dynamics supervision. (c) Interaction Dynamics models object responses $\Delta \hat{o}_t^*$ based on human relative motion $\Delta \hat{h}_t$, object pose $\hat{o}_t$, and interaction context $\hat{s}_t$.
  • Figure 2: Comparison of HOI-Dyn and CHOIS on physical plausibility and sequence-level coherence. (a–b) CHOIS produces premature object motion lacking causal timing; (c) HOI-Dyn generates more realistic post-contact responses; (d) HOI-Dyn maintains consistent human-object interaction across the full sequence. Green markers indicate object initial state and sparse waypoints.
  • Figure 3: HOI generation in realistic 3D scenes. The virtual agent interacts with different objects while maintaining physical plausibility and environmental consistency.
  • Figure 4: Effect of Horizon $K$.
  • Figure 5: Object Loss via Dynamics.
  • ...and 10 more figures