Table of Contents
Fetching ...

UniHOI: Unified Human-Object Interaction Understanding via Unified Token Space

Panqi Yang, Haodong Jing, Nanning Zheng, Yongqiang Ma

TL;DR

UniHOI addresses HOI understanding by unifying detection and generation into a single multimodal model with a shared token space, enabling bidirectional mapping between images and interaction semantics. It introduces a modality-aware token space and a symmetric Interaction-Aware Attention module, together with a cycle-consistency-based semi-supervised objective to jointly train detection and generation across heterogeneous supervision. The approach achieves state-of-the-art results on HOI detection (e.g., 48.16 mAP on HICO-DET Full, 50.74 on Rare, 51.34 Known Object) and HOI generation (e.g., Image Reward 1.17, FID 18.2, CLIP 32.46, HOI Score 0.64, IA 0.54), including strong gains on long-tailed and open-vocabulary tasks. By leveraging a unified tokenization scheme and cross-modal reasoning, UniHOI demonstrates data-efficient, open-world HOI reasoning with broad implications for joint perception and generation in multimodal systems, and provides a blueprint for bridging recognition and generation in other domains.

Abstract

In the field of human-object interaction (HOI), detection and generation are two dual tasks that have traditionally been addressed separately, hindering the development of comprehensive interaction understanding. To address this, we propose UniHOI, which jointly models HOI detection and generation via a unified token space, thereby effectively promoting knowledge sharing and enhancing generalization. Specifically, we introduce a symmetric interaction-aware attention module and a unified semi-supervised learning paradigm, enabling effective bidirectional mapping between images and interaction semantics even under limited annotations. Extensive experiments demonstrate that UniHOI achieves state-of-the-art performance in both HOI detection and generation. Specifically, UniHOI improves accuracy by 4.9% on long-tailed HOI detection and boosts interaction metrics by 42.0% on open-vocabulary generation tasks.

UniHOI: Unified Human-Object Interaction Understanding via Unified Token Space

TL;DR

UniHOI addresses HOI understanding by unifying detection and generation into a single multimodal model with a shared token space, enabling bidirectional mapping between images and interaction semantics. It introduces a modality-aware token space and a symmetric Interaction-Aware Attention module, together with a cycle-consistency-based semi-supervised objective to jointly train detection and generation across heterogeneous supervision. The approach achieves state-of-the-art results on HOI detection (e.g., 48.16 mAP on HICO-DET Full, 50.74 on Rare, 51.34 Known Object) and HOI generation (e.g., Image Reward 1.17, FID 18.2, CLIP 32.46, HOI Score 0.64, IA 0.54), including strong gains on long-tailed and open-vocabulary tasks. By leveraging a unified tokenization scheme and cross-modal reasoning, UniHOI demonstrates data-efficient, open-world HOI reasoning with broad implications for joint perception and generation in multimodal systems, and provides a blueprint for bridging recognition and generation in other domains.

Abstract

In the field of human-object interaction (HOI), detection and generation are two dual tasks that have traditionally been addressed separately, hindering the development of comprehensive interaction understanding. To address this, we propose UniHOI, which jointly models HOI detection and generation via a unified token space, thereby effectively promoting knowledge sharing and enhancing generalization. Specifically, we introduce a symmetric interaction-aware attention module and a unified semi-supervised learning paradigm, enabling effective bidirectional mapping between images and interaction semantics even under limited annotations. Extensive experiments demonstrate that UniHOI achieves state-of-the-art performance in both HOI detection and generation. Specifically, UniHOI improves accuracy by 4.9% on long-tailed HOI detection and boosts interaction metrics by 42.0% on open-vocabulary generation tasks.

Paper Structure

This paper contains 18 sections, 8 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: UniHOI is the first to achieve unified modeling for the two inverse tasks of HOI detection and generation. Through a unified token space, our method enables generalizable interaction semantics understanding and cross-task knowledge sharing. UniHOI achieves state-of-the-art results on most metrics for both HOI detection and generation. Here, HICO-D refers to the Rare metric of the Default split in the HICO-DET hico-det, other abbreviations follow similarly.
  • Figure 2: An overview of the UniHOI pipeline. The bottom-right shows the details of the IAA module, illustrating the bidirectional transformation between text tokens and visual tokens.
  • Figure 3: Visualization of interaction-aware attention maps produced by IAA for HOI detection and generation tasks. Bidirectional arrows indicate the mutual mapping between visual and textual tokens, highlighting IAA's capability in cross-modal interactive semantic modeling. The transitions among prompts, images, and HOI triplets further demonstrate unified token transformations across inverse tasks.
  • Figure 4: Qualitative results of UniHOI. For HOI detection, UniHOI demonstrates enhanced fine-grained interaction understanding; for HOI generation, it produces detailed interactive scenes, including realistic hand poses and precise tool usage.