Table of Contents
Fetching ...

Orchestrating the Symphony of Prompt Distribution Learning for Human-Object Interaction Detection

Mingda Jia, Liming Zhao, Ge Li, Yun Zheng

TL;DR

InterProDa tackles the challenge of recognizing uncommon HOI patterns and disambiguating similar HOIs by representing each HOI category as a distribution over multiple soft prompts. It learns three dedicated prompt spaces for subjects, objects, and interactions, builds Gaussian distribution spaces, and uses a dynamic orthogonal constraint to capture cross-category dependencies while maintaining intra-category diversity. Sampling from these distribution spaces generates distribution-guided queries that are fused with standard HOI decoders, enabling end-to-end training with a joint HOI loss and a distribution-regularization term. Empirically, InterProDa achieves competitive or state-of-the-art results on HICO-DET and vcoco and can be plugged into existing transformer-based HOI detectors with minimal parameter overhead, improving robustness to rare and unseen patterns.

Abstract

Human-object interaction (HOI) detectors with popular query-transformer architecture have achieved promising performance. However, accurately identifying uncommon visual patterns and distinguishing between ambiguous HOIs continue to be difficult for them. We observe that these difficulties may arise from the limited capacity of traditional detector queries in representing diverse intra-category patterns and inter-category dependencies. To address this, we introduce the Interaction Prompt Distribution Learning (InterProDa) approach. InterProDa learns multiple sets of soft prompts and estimates category distributions from various prompts. It then incorporates HOI queries with category distributions, making them capable of representing near-infinite intra-category dynamics and universal cross-category relationships. Our InterProDa detector demonstrates competitive performance on HICO-DET and vcoco benchmarks. Additionally, our method can be integrated into most transformer-based HOI detectors, significantly enhancing their performance with minimal additional parameters.

Orchestrating the Symphony of Prompt Distribution Learning for Human-Object Interaction Detection

TL;DR

InterProDa tackles the challenge of recognizing uncommon HOI patterns and disambiguating similar HOIs by representing each HOI category as a distribution over multiple soft prompts. It learns three dedicated prompt spaces for subjects, objects, and interactions, builds Gaussian distribution spaces, and uses a dynamic orthogonal constraint to capture cross-category dependencies while maintaining intra-category diversity. Sampling from these distribution spaces generates distribution-guided queries that are fused with standard HOI decoders, enabling end-to-end training with a joint HOI loss and a distribution-regularization term. Empirically, InterProDa achieves competitive or state-of-the-art results on HICO-DET and vcoco and can be plugged into existing transformer-based HOI detectors with minimal parameter overhead, improving robustness to rare and unseen patterns.

Abstract

Human-object interaction (HOI) detectors with popular query-transformer architecture have achieved promising performance. However, accurately identifying uncommon visual patterns and distinguishing between ambiguous HOIs continue to be difficult for them. We observe that these difficulties may arise from the limited capacity of traditional detector queries in representing diverse intra-category patterns and inter-category dependencies. To address this, we introduce the Interaction Prompt Distribution Learning (InterProDa) approach. InterProDa learns multiple sets of soft prompts and estimates category distributions from various prompts. It then incorporates HOI queries with category distributions, making them capable of representing near-infinite intra-category dynamics and universal cross-category relationships. Our InterProDa detector demonstrates competitive performance on HICO-DET and vcoco benchmarks. Additionally, our method can be integrated into most transformer-based HOI detectors, significantly enhancing their performance with minimal additional parameters.

Paper Structure

This paper contains 20 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Traditional HOI detectors memorize limited and common visual content, which makes them fragile in recognizing uncommon visual patterns. InterProDa models each HOI category as a distribution to represent unlimited intra-category patterns. We also learn cross-category dependencies with constraints between every distribution. Zoom in for details.
  • Figure 2: Performance comparison w.r.t different settings of decoder queries on HICO-DET. We expand the query of our model with an additional intra-category pattern dimension. Each model is labeled with a format like $32\times2$, which indicates a model with a category (query) dimension of 32 and pattern dimension of 2. Models with the same color share identical parameters and overall query dimensions (for example, $32\times2$ equals $64\times1$). Comparisons between models of the same color demonstrate that a higher pattern dimension improves performance. More experiment details in Section 4.7.
  • Figure 3: The pipeline of InterProDa. We learn multiple groups of soft prompts for subject, object, and interaction categories. Then, we estimate the distributions of these category prompt embeddings and constrain them in a continuous feature space. Such an approach learns diverse intra-category patterns in each category distribution and captures the universal inter-category dependencies. We sampled from the learned distribution space to obtain a category distribution query to enhance HOI prediction.
  • Figure 4: Visualization of intra-category variances of selected prompt distributions learned from HICO-DET. Each row refers to different categories, while each token refers to the variance of a single prompt. We select four distributions with the highest average variance and 4 with the lowest variance, respectively. We sort and list them from top to bottom. We also show HICO-DET images that correspond to two of these category distributions, the distributions with high variance always indicate HOI categories with more diverse visual patterns.
  • Figure 5: T-SNE visualization of learned prompt distributions of 20 random interaction categories on HICO-DET. Each cluster with the same color refers to the category prompt embeddings belonging to a distinct HOI category distribution. The learned distribution space has clear margins between categories, showing suitable cross-category dependencies. Best viewed in color.