Exploring Interactive Semantic Alignment for Efficient HOI Detection with Vision-language Model
Jihao Dong, Renjie Pan, Hua Yang
TL;DR
This work tackles HOI detection by addressing the context gap in two-stage transformers and leveraging CLIP to align visual–textual interactions. It introduces ISA-HOI, a two-stage detector with two modules: Improved Features (IF) to fuse global CLIP context, local ROI cues, and object text embeddings for robust interaction representations, and Verb Semantic Improvement (VSI) to refine verb embeddings via cross-modal fusion. Training uses cosine alignment between interaction features and verb semantics with focal loss, while inference combines detection and verb scores through a geometric mean to produce HOI predictions. Across HICO-DET and V-COCO, ISA-HOI delivers competitive or state-of-the-art results, with pronounced gains in zero-shot settings, demonstrating effective cross-modal alignment and efficient training. The approach highlights the practical potential of CLIP-guided interaction semantics for robust, data-efficient HOI understanding, especially for unseen interactions.
Abstract
Human-Object Interaction (HOI) detection aims to localize human-object pairs and comprehend their interactions. Recently, two-stage transformer-based methods have demonstrated competitive performance. However, these methods frequently focus on object appearance features and ignore global contextual information. Besides, vision-language model CLIP which effectively aligns visual and text embeddings has shown great potential in zero-shot HOI detection. Based on the former facts, We introduce a novel HOI detector named ISA-HOI, which extensively leverages knowledge from CLIP, aligning interactive semantics between visual and textual features. We first extract global context of image and local features of object to Improve interaction Features in images (IF). On the other hand, we propose a Verb Semantic Improvement (VSI) module to enhance textual features of verb labels via cross-modal fusion. Ultimately, our method achieves competitive results on the HICO-DET and V-COCO benchmarks with much fewer training epochs, and outperforms the state-of-the-art under zero-shot settings.
