Table of Contents
Fetching ...

ChainHOI: Joint-based Kinematic Chain Modeling for Human-Object Interaction Generation

Ling-An Zeng, Guohong Huang, Yi-Lin Wei, Shengbo Gu, Yu-Ming Tang, Jingke Meng, Wei-Shi Zheng

TL;DR

ChainHOI introduces a dual-level framework for text-driven HOI generation that explicitly models joint-level interactions with a Generative Spatiotemporal Graph Convolution Network (GST-GCN) and kinetic-chain interactions with a Kinematics-based Interaction Module (KIM). By employing a joint graph that includes an object node and a kinetic-chain token mechanism, the model captures both short-/long-term joint relations and inter-joint coordination within biomechanical constraints. Training combines diffusion-based generation with auxiliary losses that penalize penetration and incorrect object motion, yielding semantically coherent and physically plausible HOIs. Evaluations on BEHAVE and OMOMO demonstrate state-of-the-art performance in motion quality and interaction realism, with additional gains from Affordance-guided Interaction Correction (AIC). The approach advances text-driven HOI synthesis by making the physics and geometry of human-object interactions explicit and biomechanically coherent, enabling more controllable and realistic animations for AR/VR, gaming, and film production.

Abstract

We propose ChainHOI, a novel approach for text-driven human-object interaction (HOI) generation that explicitly models interactions at both the joint and kinetic chain levels. Unlike existing methods that implicitly model interactions using full-body poses as tokens, we argue that explicitly modeling joint-level interactions is more natural and effective for generating realistic HOIs, as it directly captures the geometric and semantic relationships between joints, rather than modeling interactions in the latent pose space. To this end, ChainHOI introduces a novel joint graph to capture potential interactions with objects, and a Generative Spatiotemporal Graph Convolution Network to explicitly model interactions at the joint level. Furthermore, we propose a Kinematics-based Interaction Module that explicitly models interactions at the kinetic chain level, ensuring more realistic and biomechanically coherent motions. Evaluations on two public datasets demonstrate that ChainHOI significantly outperforms previous methods, generating more realistic, and semantically consistent HOIs. Code is available \href{https://github.com/qinghuannn/ChainHOI}{here}.

ChainHOI: Joint-based Kinematic Chain Modeling for Human-Object Interaction Generation

TL;DR

ChainHOI introduces a dual-level framework for text-driven HOI generation that explicitly models joint-level interactions with a Generative Spatiotemporal Graph Convolution Network (GST-GCN) and kinetic-chain interactions with a Kinematics-based Interaction Module (KIM). By employing a joint graph that includes an object node and a kinetic-chain token mechanism, the model captures both short-/long-term joint relations and inter-joint coordination within biomechanical constraints. Training combines diffusion-based generation with auxiliary losses that penalize penetration and incorrect object motion, yielding semantically coherent and physically plausible HOIs. Evaluations on BEHAVE and OMOMO demonstrate state-of-the-art performance in motion quality and interaction realism, with additional gains from Affordance-guided Interaction Correction (AIC). The approach advances text-driven HOI synthesis by making the physics and geometry of human-object interactions explicit and biomechanically coherent, enabling more controllable and realistic animations for AR/VR, gaming, and film production.

Abstract

We propose ChainHOI, a novel approach for text-driven human-object interaction (HOI) generation that explicitly models interactions at both the joint and kinetic chain levels. Unlike existing methods that implicitly model interactions using full-body poses as tokens, we argue that explicitly modeling joint-level interactions is more natural and effective for generating realistic HOIs, as it directly captures the geometric and semantic relationships between joints, rather than modeling interactions in the latent pose space. To this end, ChainHOI introduces a novel joint graph to capture potential interactions with objects, and a Generative Spatiotemporal Graph Convolution Network to explicitly model interactions at the joint level. Furthermore, we propose a Kinematics-based Interaction Module that explicitly models interactions at the kinetic chain level, ensuring more realistic and biomechanically coherent motions. Evaluations on two public datasets demonstrate that ChainHOI significantly outperforms previous methods, generating more realistic, and semantically consistent HOIs. Code is available \href{https://github.com/qinghuannn/ChainHOI}{here}.

Paper Structure

This paper contains 38 sections, 7 equations, 11 figures, 15 tables.

Figures (11)

  • Figure 1: Given a text description and target object geometry, our ChainHOI effectively generates high-quality human-object interaction sequences that are both logical and realistic.
  • Figure 2: Overview of ChainHOI. ChainHOI is a diffusion-based model with $N$ identical blocks. Each block contains a Generative Spatiotemporal GCN (GST-GCN) and a Kinematics-based Interaction Module (KIM) to model interactions at the joint and kinetic chain levels. GST-GCN, comprising an ST-GCN and a Semantic-consistent Module, captures short- and long-term information while ensuring semantic consistency. KIM includes a Context-aware Decoder and a Kinematic-aware Decoder to capture HOI context (textual and object geometry) and to model intra- and inter-kinetic chain interactions. Input and output projection layers are omitted for clarity.
  • Figure 3: Design of the HOI Joint Graph. The object node contains object information and is connected to potential interaction joints. The foot-contact node is added to prevent foot sliding.
  • Figure 4: Semantic-consistent Module and Context-aware Decoder. Due to input differences, both modules have a similar structure, though their objectives differ. The former models long-term information and ensures semantic consistency, while the latter models context to plan the goals of each kinetic chain.
  • Figure 5: Design of Kinetic Chains. Beyond internal kinetic chains, an additional interaction chain is used to explicitly model the interactions between joints and the object.
  • ...and 6 more figures