Table of Contents
Fetching ...

Visual Diversity and Region-aware Prompt Learning for Zero-shot HOI Detection

Chanhyeong Yang, Taehoon Song, Jihwan Park, Hyunwoo J. Kim

TL;DR

This work introduces VDRP, a prompt-learning framework for zero-shot HOI detection that tackles intra-class visual diversity and inter-class entanglement by (1) injecting group-wise visual variance and Gaussian perturbations into verb prompts (visual diversity-aware prompts) and (2) augmenting prompts with region-specific concepts from human, object, and union regions (region-aware prompts). The approach uses a two-stage HOI pipeline with a frozen detector and CLIP-based backbone, extracting region features and computing verb logits via region-conditioned prompts whose outputs are averaged to yield HOI predictions. Thorough experiments on HICO-DET across four zero-shot settings show state-of-the-art performance, with ablations confirming the complementary benefits of VDP and RAP and qualitative results illustrating interpretable region-wise concept retrieval. The method demonstrates strong generalization, parameter efficiency, and scalability to stronger backbones, highlighting the value of distributional and region-grounded prompt learning for robust zero-shot HOI understanding.

Abstract

Zero-shot Human-Object Interaction detection aims to localize humans and objects in an image and recognize their interaction, even when specific verb-object pairs are unseen during training. Recent works have shown promising results using prompt learning with pretrained vision-language models such as CLIP, which align natural language prompts with visual features in a shared embedding space. However, existing approaches still fail to handle the visual complexity of interaction, including (1) intra-class visual diversity, where instances of the same verb appear in diverse poses and contexts, and (2) inter-class visual entanglement, where distinct verbs yield visually similar patterns. To address these challenges, we propose VDRP, a framework for Visual Diversity and Region-aware Prompt learning. First, we introduce a visual diversity-aware prompt learning strategy that injects group-wise visual variance into the context embedding. We further apply Gaussian perturbation to encourage the prompts to capture diverse visual variations of a verb. Second, we retrieve region-specific concepts from the human, object, and union regions. These are used to augment the diversity-aware prompt embeddings, yielding region-aware prompts that enhance verb-level discrimination. Experiments on the HICO-DET benchmark demonstrate that our method achieves state-of-the-art performance under four zero-shot evaluation settings, effectively addressing both intra-class diversity and inter-class visual entanglement. Code is available at https://github.com/mlvlab/VDRP.

Visual Diversity and Region-aware Prompt Learning for Zero-shot HOI Detection

TL;DR

This work introduces VDRP, a prompt-learning framework for zero-shot HOI detection that tackles intra-class visual diversity and inter-class entanglement by (1) injecting group-wise visual variance and Gaussian perturbations into verb prompts (visual diversity-aware prompts) and (2) augmenting prompts with region-specific concepts from human, object, and union regions (region-aware prompts). The approach uses a two-stage HOI pipeline with a frozen detector and CLIP-based backbone, extracting region features and computing verb logits via region-conditioned prompts whose outputs are averaged to yield HOI predictions. Thorough experiments on HICO-DET across four zero-shot settings show state-of-the-art performance, with ablations confirming the complementary benefits of VDP and RAP and qualitative results illustrating interpretable region-wise concept retrieval. The method demonstrates strong generalization, parameter efficiency, and scalability to stronger backbones, highlighting the value of distributional and region-grounded prompt learning for robust zero-shot HOI understanding.

Abstract

Zero-shot Human-Object Interaction detection aims to localize humans and objects in an image and recognize their interaction, even when specific verb-object pairs are unseen during training. Recent works have shown promising results using prompt learning with pretrained vision-language models such as CLIP, which align natural language prompts with visual features in a shared embedding space. However, existing approaches still fail to handle the visual complexity of interaction, including (1) intra-class visual diversity, where instances of the same verb appear in diverse poses and contexts, and (2) inter-class visual entanglement, where distinct verbs yield visually similar patterns. To address these challenges, we propose VDRP, a framework for Visual Diversity and Region-aware Prompt learning. First, we introduce a visual diversity-aware prompt learning strategy that injects group-wise visual variance into the context embedding. We further apply Gaussian perturbation to encourage the prompts to capture diverse visual variations of a verb. Second, we retrieve region-specific concepts from the human, object, and union regions. These are used to augment the diversity-aware prompt embeddings, yielding region-aware prompts that enhance verb-level discrimination. Experiments on the HICO-DET benchmark demonstrate that our method achieves state-of-the-art performance under four zero-shot evaluation settings, effectively addressing both intra-class diversity and inter-class visual entanglement. Code is available at https://github.com/mlvlab/VDRP.

Paper Structure

This paper contains 23 sections, 21 equations, 8 figures, 17 tables.

Figures (8)

  • Figure 1: Analysis of the visual complexity in HOI detection. (A) Verb classes exhibit significant intra-class visual diversity, where instances of the same verb (e.g., "holding a baseball glove") appear under varied poses, viewpoints, and scene contexts. To quantify this, we crop the union region and extract the CLIP visual CLS feature. A diversity score is then computed as the expected cosine dissimilarity $\mathbb{E}[1 - \cos(\cdot)]$ across samples of the same class. Verb classes exhibit higher diversity (0.364 $\pm$ 0.060) than object classes (0.274 $\pm$ 0.048), highlighting the difficulty of representing verbs with a single static embedding. (B) Verb classification also suffers from inter-class visual entanglement, where semantically distinct verbs (e.g., "eating", "licking", "sitting at") yield visually similar patterns. To visualize this, we randomly select five verb classes, extract their union-region CLS features, and project them to 2D using t-SNE. The resulting clusters show significant overlap, highlighting the need for region-aware prompts to improve verb separability in HOI detection.
  • Figure 2: Overview of our VDRP framework. (A) We adopt a two-stage HOI detection pipeline with a frozen detector and a CLIP image encoder to extract human ($\mathbf{x}_\text{h}$), object ($\mathbf{x}_\text{o}$), and union ($\mathbf{x}_{\tilde{\text{u}}}$) features. A spatial head further refines the union feature into $\mathbf{x}_{\text{u}}$ for region-aware prompts via spatial encoding. (B) Visual diversity-aware prompts are generated by injecting group-wise variance and perturbation to model intra-class variation. (C) Retrieved region concepts are then fused with these prompts to produce final region-aware prompts $\mathbf{T}_\text{h}$, $\mathbf{T}_\text{o}$, and $\mathbf{T}_\text{u}$ used for verb classification.
  • Figure 3: Detailed architecture of our methods. (A) To model intra-class variation, we compute verb-wise visual variance $\boldsymbol{\sigma}_v^2$ from union-region CLS features, average them over similar verbs to obtain group-wise variance $\bar{\boldsymbol{\sigma}}_v^2$, and inject it into the shared context embedding $\mathbf{E}$ via an MLP. This is combined with the verb prompt $\bar{\mathbf{P}}_v$ and encoded by the CLIP text encoder to produce $\mathbf{t}^v$, which is further perturbed using Gaussian noise scaled by visual variance. (B) For inter-class discriminability, we retrieve region concepts from features $\mathbf{x}_{(\cdot)}$ using a Sparsemax over a concept pool $\mathcal{C}_{(\cdot)}^v$, and add the result to $\tilde{\mathbf{t}}^v$ to obtain the final region-aware prompt $\hat{\mathbf{t}}^v_{(\cdot)}$.
  • Figure 4: Qualitative examples of region concept generation and retrieval. (A) Given a verb prompt and region type, an LLM generates $K$ region concepts per verb. (B) Retrieved concepts for “Licking” and “Eating” highlight subtle region concepts that help disambiguate visually similar interactions. Concepts are color-coded by region: blue (human), red (object), yellow (union).
  • Figure 5: Pairwise inter-class distances between prompts and visual features. We report the average pairwise cosine distance (i.e., $D = \mathbb{E}_{i \neq j}[1 - \cos(z_i, z_j)]$) across verb classes, for both visual and prompt embeddings, before and after training. Visual features are extracted from union regions. Before training, we use the CLS token from the CLIP visual encoder applied to cropped union region images. After training, we follow the RoI-Align feature extraction pipeline consistent with the two-stage method (i.e., pooling patch embeddings within the union box). For each verb class, a medoid is selected among all union features to represent its prototype. While prompt embeddings are initially collapsed with low diversity, VDRP maintains a balanced and aligned distribution relative to visual features, unlike CMMP which over-separates prompts and disrupts cross-modal structure.
  • ...and 3 more figures