UniHOI: Unified Human-Object Interaction Understanding via Unified Token Space
Panqi Yang, Haodong Jing, Nanning Zheng, Yongqiang Ma
TL;DR
UniHOI addresses HOI understanding by unifying detection and generation into a single multimodal model with a shared token space, enabling bidirectional mapping between images and interaction semantics. It introduces a modality-aware token space and a symmetric Interaction-Aware Attention module, together with a cycle-consistency-based semi-supervised objective to jointly train detection and generation across heterogeneous supervision. The approach achieves state-of-the-art results on HOI detection (e.g., 48.16 mAP on HICO-DET Full, 50.74 on Rare, 51.34 Known Object) and HOI generation (e.g., Image Reward 1.17, FID 18.2, CLIP 32.46, HOI Score 0.64, IA 0.54), including strong gains on long-tailed and open-vocabulary tasks. By leveraging a unified tokenization scheme and cross-modal reasoning, UniHOI demonstrates data-efficient, open-world HOI reasoning with broad implications for joint perception and generation in multimodal systems, and provides a blueprint for bridging recognition and generation in other domains.
Abstract
In the field of human-object interaction (HOI), detection and generation are two dual tasks that have traditionally been addressed separately, hindering the development of comprehensive interaction understanding. To address this, we propose UniHOI, which jointly models HOI detection and generation via a unified token space, thereby effectively promoting knowledge sharing and enhancing generalization. Specifically, we introduce a symmetric interaction-aware attention module and a unified semi-supervised learning paradigm, enabling effective bidirectional mapping between images and interaction semantics even under limited annotations. Extensive experiments demonstrate that UniHOI achieves state-of-the-art performance in both HOI detection and generation. Specifically, UniHOI improves accuracy by 4.9% on long-tailed HOI detection and boosts interaction metrics by 42.0% on open-vocabulary generation tasks.
