Table of Contents
Fetching ...

Exploring Interactive Semantic Alignment for Efficient HOI Detection with Vision-language Model

Jihao Dong, Renjie Pan, Hua Yang

TL;DR

This work tackles HOI detection by addressing the context gap in two-stage transformers and leveraging CLIP to align visual–textual interactions. It introduces ISA-HOI, a two-stage detector with two modules: Improved Features (IF) to fuse global CLIP context, local ROI cues, and object text embeddings for robust interaction representations, and Verb Semantic Improvement (VSI) to refine verb embeddings via cross-modal fusion. Training uses cosine alignment between interaction features and verb semantics with focal loss, while inference combines detection and verb scores through a geometric mean to produce HOI predictions. Across HICO-DET and V-COCO, ISA-HOI delivers competitive or state-of-the-art results, with pronounced gains in zero-shot settings, demonstrating effective cross-modal alignment and efficient training. The approach highlights the practical potential of CLIP-guided interaction semantics for robust, data-efficient HOI understanding, especially for unseen interactions.

Abstract

Human-Object Interaction (HOI) detection aims to localize human-object pairs and comprehend their interactions. Recently, two-stage transformer-based methods have demonstrated competitive performance. However, these methods frequently focus on object appearance features and ignore global contextual information. Besides, vision-language model CLIP which effectively aligns visual and text embeddings has shown great potential in zero-shot HOI detection. Based on the former facts, We introduce a novel HOI detector named ISA-HOI, which extensively leverages knowledge from CLIP, aligning interactive semantics between visual and textual features. We first extract global context of image and local features of object to Improve interaction Features in images (IF). On the other hand, we propose a Verb Semantic Improvement (VSI) module to enhance textual features of verb labels via cross-modal fusion. Ultimately, our method achieves competitive results on the HICO-DET and V-COCO benchmarks with much fewer training epochs, and outperforms the state-of-the-art under zero-shot settings.

Exploring Interactive Semantic Alignment for Efficient HOI Detection with Vision-language Model

TL;DR

This work tackles HOI detection by addressing the context gap in two-stage transformers and leveraging CLIP to align visual–textual interactions. It introduces ISA-HOI, a two-stage detector with two modules: Improved Features (IF) to fuse global CLIP context, local ROI cues, and object text embeddings for robust interaction representations, and Verb Semantic Improvement (VSI) to refine verb embeddings via cross-modal fusion. Training uses cosine alignment between interaction features and verb semantics with focal loss, while inference combines detection and verb scores through a geometric mean to produce HOI predictions. Across HICO-DET and V-COCO, ISA-HOI delivers competitive or state-of-the-art results, with pronounced gains in zero-shot settings, demonstrating effective cross-modal alignment and efficient training. The approach highlights the practical potential of CLIP-guided interaction semantics for robust, data-efficient HOI understanding, especially for unseen interactions.

Abstract

Human-Object Interaction (HOI) detection aims to localize human-object pairs and comprehend their interactions. Recently, two-stage transformer-based methods have demonstrated competitive performance. However, these methods frequently focus on object appearance features and ignore global contextual information. Besides, vision-language model CLIP which effectively aligns visual and text embeddings has shown great potential in zero-shot HOI detection. Based on the former facts, We introduce a novel HOI detector named ISA-HOI, which extensively leverages knowledge from CLIP, aligning interactive semantics between visual and textual features. We first extract global context of image and local features of object to Improve interaction Features in images (IF). On the other hand, we propose a Verb Semantic Improvement (VSI) module to enhance textual features of verb labels via cross-modal fusion. Ultimately, our method achieves competitive results on the HICO-DET and V-COCO benchmarks with much fewer training epochs, and outperforms the state-of-the-art under zero-shot settings.
Paper Structure (13 sections, 6 equations, 5 figures, 5 tables)

This paper contains 13 sections, 6 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Comparison of the Architecture with existing works. (a) Recent one-stage method with CLIP directly initializes classifier with labels' text embeddings. (b) Common two-stage method relies solely on object features for interaction recognition. (c) Our proposed ISA-HOI improves interaction features and verb category labels' text embeddings for alignment. Differentiation is highlighted in bold.
  • Figure 2: Overview of our proposed method. The framework comprises two stages: (a) human-object detection utilizing a pre-trained DETR, and (b) subsequent interaction recognition employing our module IF (Subsec \ref{['IF']}) and VSI (Subsec \ref{['VSI']}).
  • Figure 3: The composition of interaction queries.
  • Figure 4: Visualization of predictions. The first column displays the predicted results of ISA-HOI$_{s}$, while the second and third columns showcase attention maps of various interactions. All images are sampled from the HICO-DET test set.
  • Figure 5: Visualization of efficiency comparison. The size of the points directly proportional to the relative training gpu time annotated alongside them.