Visual Diversity and Region-aware Prompt Learning for Zero-shot HOI Detection
Chanhyeong Yang, Taehoon Song, Jihwan Park, Hyunwoo J. Kim
TL;DR
This work introduces VDRP, a prompt-learning framework for zero-shot HOI detection that tackles intra-class visual diversity and inter-class entanglement by (1) injecting group-wise visual variance and Gaussian perturbations into verb prompts (visual diversity-aware prompts) and (2) augmenting prompts with region-specific concepts from human, object, and union regions (region-aware prompts). The approach uses a two-stage HOI pipeline with a frozen detector and CLIP-based backbone, extracting region features and computing verb logits via region-conditioned prompts whose outputs are averaged to yield HOI predictions. Thorough experiments on HICO-DET across four zero-shot settings show state-of-the-art performance, with ablations confirming the complementary benefits of VDP and RAP and qualitative results illustrating interpretable region-wise concept retrieval. The method demonstrates strong generalization, parameter efficiency, and scalability to stronger backbones, highlighting the value of distributional and region-grounded prompt learning for robust zero-shot HOI understanding.
Abstract
Zero-shot Human-Object Interaction detection aims to localize humans and objects in an image and recognize their interaction, even when specific verb-object pairs are unseen during training. Recent works have shown promising results using prompt learning with pretrained vision-language models such as CLIP, which align natural language prompts with visual features in a shared embedding space. However, existing approaches still fail to handle the visual complexity of interaction, including (1) intra-class visual diversity, where instances of the same verb appear in diverse poses and contexts, and (2) inter-class visual entanglement, where distinct verbs yield visually similar patterns. To address these challenges, we propose VDRP, a framework for Visual Diversity and Region-aware Prompt learning. First, we introduce a visual diversity-aware prompt learning strategy that injects group-wise visual variance into the context embedding. We further apply Gaussian perturbation to encourage the prompts to capture diverse visual variations of a verb. Second, we retrieve region-specific concepts from the human, object, and union regions. These are used to augment the diversity-aware prompt embeddings, yielding region-aware prompts that enhance verb-level discrimination. Experiments on the HICO-DET benchmark demonstrate that our method achieves state-of-the-art performance under four zero-shot evaluation settings, effectively addressing both intra-class diversity and inter-class visual entanglement. Code is available at https://github.com/mlvlab/VDRP.
