Efficient Explicit Joint-level Interaction Modeling with Mamba for Text-guided HOI Generation
Guohong Huang, Ling-An Zeng, Zexin Zheng, Shengbo Gu, Wei-Shi Zheng
TL;DR
This work tackles text-guided human-object interaction (HOI) generation by enabling explicit joint-level modeling with high efficiency. It introduces EJIM, a diffusion-based framework that uses a detailed joint-level HOI representation and a stack of Dual-branch HOI Mamba and Condition Injector modules, augmented by Dynamic Interaction Blocks with progressive joint masking. The approach delivers state-of-the-art motion and interaction quality on BEHAVE and OMOMO while reducing inference time to about 5% of previous methods, validated through extensive quantitative and qualitative experiments and ablations. By integrating text semantics and object geometry at the joint level, EJIM enables more accurate, temporally coherent HOI sequences suitable for animation and embodied AI. Limitations include handling multi-object scenes and non-rigid objects, pointing to future work to broaden applicability and robustness.
Abstract
We propose a novel approach for generating text-guided human-object interactions (HOIs) that achieves explicit joint-level interaction modeling in a computationally efficient manner. Previous methods represent the entire human body as a single token, making it difficult to capture fine-grained joint-level interactions and resulting in unrealistic HOIs. However, treating each individual joint as a token would yield over twenty times more tokens, increasing computational overhead. To address these challenges, we introduce an Efficient Explicit Joint-level Interaction Model (EJIM). EJIM features a Dual-branch HOI Mamba that separately and efficiently models spatiotemporal HOI information, as well as a Dual-branch Condition Injector for integrating text semantics and object geometry into human and object motions. Furthermore, we design a Dynamic Interaction Block and a progressive masking mechanism to iteratively filter out irrelevant joints, ensuring accurate and nuanced interaction modeling. Extensive quantitative and qualitative evaluations on public datasets demonstrate that EJIM surpasses previous works by a large margin while using only 5\% of the inference time. Code is available \href{https://github.com/Huanggh531/EJIM}{here}.
