Table of Contents
Fetching ...

Efficient Explicit Joint-level Interaction Modeling with Mamba for Text-guided HOI Generation

Guohong Huang, Ling-An Zeng, Zexin Zheng, Shengbo Gu, Wei-Shi Zheng

TL;DR

This work tackles text-guided human-object interaction (HOI) generation by enabling explicit joint-level modeling with high efficiency. It introduces EJIM, a diffusion-based framework that uses a detailed joint-level HOI representation and a stack of Dual-branch HOI Mamba and Condition Injector modules, augmented by Dynamic Interaction Blocks with progressive joint masking. The approach delivers state-of-the-art motion and interaction quality on BEHAVE and OMOMO while reducing inference time to about 5% of previous methods, validated through extensive quantitative and qualitative experiments and ablations. By integrating text semantics and object geometry at the joint level, EJIM enables more accurate, temporally coherent HOI sequences suitable for animation and embodied AI. Limitations include handling multi-object scenes and non-rigid objects, pointing to future work to broaden applicability and robustness.

Abstract

We propose a novel approach for generating text-guided human-object interactions (HOIs) that achieves explicit joint-level interaction modeling in a computationally efficient manner. Previous methods represent the entire human body as a single token, making it difficult to capture fine-grained joint-level interactions and resulting in unrealistic HOIs. However, treating each individual joint as a token would yield over twenty times more tokens, increasing computational overhead. To address these challenges, we introduce an Efficient Explicit Joint-level Interaction Model (EJIM). EJIM features a Dual-branch HOI Mamba that separately and efficiently models spatiotemporal HOI information, as well as a Dual-branch Condition Injector for integrating text semantics and object geometry into human and object motions. Furthermore, we design a Dynamic Interaction Block and a progressive masking mechanism to iteratively filter out irrelevant joints, ensuring accurate and nuanced interaction modeling. Extensive quantitative and qualitative evaluations on public datasets demonstrate that EJIM surpasses previous works by a large margin while using only 5\% of the inference time. Code is available \href{https://github.com/Huanggh531/EJIM}{here}.

Efficient Explicit Joint-level Interaction Modeling with Mamba for Text-guided HOI Generation

TL;DR

This work tackles text-guided human-object interaction (HOI) generation by enabling explicit joint-level modeling with high efficiency. It introduces EJIM, a diffusion-based framework that uses a detailed joint-level HOI representation and a stack of Dual-branch HOI Mamba and Condition Injector modules, augmented by Dynamic Interaction Blocks with progressive joint masking. The approach delivers state-of-the-art motion and interaction quality on BEHAVE and OMOMO while reducing inference time to about 5% of previous methods, validated through extensive quantitative and qualitative experiments and ablations. By integrating text semantics and object geometry at the joint level, EJIM enables more accurate, temporally coherent HOI sequences suitable for animation and embodied AI. Limitations include handling multi-object scenes and non-rigid objects, pointing to future work to broaden applicability and robustness.

Abstract

We propose a novel approach for generating text-guided human-object interactions (HOIs) that achieves explicit joint-level interaction modeling in a computationally efficient manner. Previous methods represent the entire human body as a single token, making it difficult to capture fine-grained joint-level interactions and resulting in unrealistic HOIs. However, treating each individual joint as a token would yield over twenty times more tokens, increasing computational overhead. To address these challenges, we introduce an Efficient Explicit Joint-level Interaction Model (EJIM). EJIM features a Dual-branch HOI Mamba that separately and efficiently models spatiotemporal HOI information, as well as a Dual-branch Condition Injector for integrating text semantics and object geometry into human and object motions. Furthermore, we design a Dynamic Interaction Block and a progressive masking mechanism to iteratively filter out irrelevant joints, ensuring accurate and nuanced interaction modeling. Extensive quantitative and qualitative evaluations on public datasets demonstrate that EJIM surpasses previous works by a large margin while using only 5\% of the inference time. Code is available \href{https://github.com/Huanggh531/EJIM}{here}.

Paper Structure

This paper contains 29 sections, 5 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Our EJIM can generate realistic 3D human-object interactions guided by text descriptions and object geometry, with colors transitioning from lighter to darker to represent the passage of time.
  • Figure 2: Overview of our EJIM. Our EJIM takes a noisy HOI sequence $\mathbf{x}_t = \{\mathbf{x}_t^o, \mathbf{x}_t^h\}$ as input and generates denoised results $\mathbf{x}_{t-1}$. $\mathbf{x}_t^o$ and $\mathbf{x}_t^h$ are projected into a latent space via linear projections $\text{E}_o$ and $\text{E}_h$, respectively. In each module, a Dual-branch HOI Mamba (DHM) is used to model spatiotemporal information, while a Dual-branch Condition Injector (DCI) injects conditional information. Two Dynamic Interaction Blocks are employed to model interactions, guided by a dynamic interaction mask that is progressively updated in each module to filter out irrelevant joints, enabling more accurate interaction modeling.
  • Figure 3: (a) Illustration of human joints. (b) Our limb division scheme. Here, the virtual foot-ground contact joint is duplicated and assigned to both lower limbs to mitigate foot skating. (c) The Limb-guided scan in our Spatial Mamba reorders joints by limb groupings and inserts learnable tokens to define distinct limbs. (d) The vanilla scan approach for comparison.
  • Figure 4: Our progressive masking mechanism. Initially, all joints are visible. At each Joint-level Interaction Module, we filter out $k$ joints with the lowest attention scores, leading to more accurate interaction modeling.
  • Figure 5: Qualitative comparisons on the BEHAVE dataset. Red boxes highlight issues like mesh penetration, large contact distances, or text inconsistencies. Our approach generates more realistic and plausible human-object interactions. The mesh color darkens over time to represent progress.