Orchestrating the Symphony of Prompt Distribution Learning for Human-Object Interaction Detection
Mingda Jia, Liming Zhao, Ge Li, Yun Zheng
TL;DR
InterProDa tackles the challenge of recognizing uncommon HOI patterns and disambiguating similar HOIs by representing each HOI category as a distribution over multiple soft prompts. It learns three dedicated prompt spaces for subjects, objects, and interactions, builds Gaussian distribution spaces, and uses a dynamic orthogonal constraint to capture cross-category dependencies while maintaining intra-category diversity. Sampling from these distribution spaces generates distribution-guided queries that are fused with standard HOI decoders, enabling end-to-end training with a joint HOI loss and a distribution-regularization term. Empirically, InterProDa achieves competitive or state-of-the-art results on HICO-DET and vcoco and can be plugged into existing transformer-based HOI detectors with minimal parameter overhead, improving robustness to rare and unseen patterns.
Abstract
Human-object interaction (HOI) detectors with popular query-transformer architecture have achieved promising performance. However, accurately identifying uncommon visual patterns and distinguishing between ambiguous HOIs continue to be difficult for them. We observe that these difficulties may arise from the limited capacity of traditional detector queries in representing diverse intra-category patterns and inter-category dependencies. To address this, we introduce the Interaction Prompt Distribution Learning (InterProDa) approach. InterProDa learns multiple sets of soft prompts and estimates category distributions from various prompts. It then incorporates HOI queries with category distributions, making them capable of representing near-infinite intra-category dynamics and universal cross-category relationships. Our InterProDa detector demonstrates competitive performance on HICO-DET and vcoco benchmarks. Additionally, our method can be integrated into most transformer-based HOI detectors, significantly enhancing their performance with minimal additional parameters.
