Table of Contents
Fetching ...

Open-World Human-Object Interaction Detection via Multi-modal Prompts

Jie Yang, Bingliang Li, Ailing Zeng, Lei Zhang, Ruimao Zhang

TL;DR

This work tackles open-world human-object interaction detection by introducing MP-HOI, a multi-modal prompt-based detector that leverages textual prompts for open-set generalization and visual prompts as exemplars to mitigate description ambiguity. It couples a Representative Feature Encoder with diffusion- and CLIP-derived features, and uses a cross-modal contrastive loss to align prompts with objects and interactions. The authors build Magic-HOI, a unified large-scale HOI dataset, and SynHOI, a high-quality synthetic counterpart to address long-tail issues, enabling scalable training. Empirical results show state-of-the-art performance across benchmarks and strong zero-shot capabilities, along with demonstrated open-world robustness to both textual and visual prompts. The approach promises practical impact for flexible, real-world HOI understanding in diverse visual environments.

Abstract

In this paper, we develop \textbf{MP-HOI}, a powerful Multi-modal Prompt-based HOI detector designed to leverage both textual descriptions for open-set generalization and visual exemplars for handling high ambiguity in descriptions, realizing HOI detection in the open world. Specifically, it integrates visual prompts into existing language-guided-only HOI detectors to handle situations where textual descriptions face difficulties in generalization and to address complex scenarios with high interaction ambiguity. To facilitate MP-HOI training, we build a large-scale HOI dataset named Magic-HOI, which gathers six existing datasets into a unified label space, forming over 186K images with 2.4K objects, 1.2K actions, and 20K HOI interactions. Furthermore, to tackle the long-tail issue within the Magic-HOI dataset, we introduce an automated pipeline for generating realistically annotated HOI images and present SynHOI, a high-quality synthetic HOI dataset containing 100K images. Leveraging these two datasets, MP-HOI optimizes the HOI task as a similarity learning process between multi-modal prompts and objects/interactions via a unified contrastive loss, to learn generalizable and transferable objects/interactions representations from large-scale data. MP-HOI could serve as a generalist HOI detector, surpassing the HOI vocabulary of existing expert models by more than 30 times. Concurrently, our results demonstrate that MP-HOI exhibits remarkable zero-shot capability in real-world scenarios and consistently achieves a new state-of-the-art performance across various benchmarks.

Open-World Human-Object Interaction Detection via Multi-modal Prompts

TL;DR

This work tackles open-world human-object interaction detection by introducing MP-HOI, a multi-modal prompt-based detector that leverages textual prompts for open-set generalization and visual prompts as exemplars to mitigate description ambiguity. It couples a Representative Feature Encoder with diffusion- and CLIP-derived features, and uses a cross-modal contrastive loss to align prompts with objects and interactions. The authors build Magic-HOI, a unified large-scale HOI dataset, and SynHOI, a high-quality synthetic counterpart to address long-tail issues, enabling scalable training. Empirical results show state-of-the-art performance across benchmarks and strong zero-shot capabilities, along with demonstrated open-world robustness to both textual and visual prompts. The approach promises practical impact for flexible, real-world HOI understanding in diverse visual environments.

Abstract

In this paper, we develop \textbf{MP-HOI}, a powerful Multi-modal Prompt-based HOI detector designed to leverage both textual descriptions for open-set generalization and visual exemplars for handling high ambiguity in descriptions, realizing HOI detection in the open world. Specifically, it integrates visual prompts into existing language-guided-only HOI detectors to handle situations where textual descriptions face difficulties in generalization and to address complex scenarios with high interaction ambiguity. To facilitate MP-HOI training, we build a large-scale HOI dataset named Magic-HOI, which gathers six existing datasets into a unified label space, forming over 186K images with 2.4K objects, 1.2K actions, and 20K HOI interactions. Furthermore, to tackle the long-tail issue within the Magic-HOI dataset, we introduce an automated pipeline for generating realistically annotated HOI images and present SynHOI, a high-quality synthetic HOI dataset containing 100K images. Leveraging these two datasets, MP-HOI optimizes the HOI task as a similarity learning process between multi-modal prompts and objects/interactions via a unified contrastive loss, to learn generalizable and transferable objects/interactions representations from large-scale data. MP-HOI could serve as a generalist HOI detector, surpassing the HOI vocabulary of existing expert models by more than 30 times. Concurrently, our results demonstrate that MP-HOI exhibits remarkable zero-shot capability in real-world scenarios and consistently achieves a new state-of-the-art performance across various benchmarks.
Paper Structure (18 sections, 3 equations, 6 figures, 11 tables)

This paper contains 18 sections, 3 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: We show (a) the coexisting composited interactions within the same person in an in-the-wild (e.g., A man is squatting on the ground, holding a painting brush, and painting the wall); (b) the long-tail distribution issue in our Magic-HOI dataset, along with another proposed SynHOI dataset to address it.
  • Figure 2: Illustration of a) HOIPrompts and b) how HOIPrompts guide the text-to-image generation process to enhance diversity. For more visualization, please refer to the Appendix.
  • Figure 3: Overview of MP-HOI, comprising three components: Representative Feature Encoder, Sequential Instance and Interaction Decoders, and Multi-modal Prompt-based Predictor. Ultimately, it can leverage textual or visual prompts to detect open-world HOIs.
  • Figure 4: Attribution analysis of Stable Diffusion between object/interaction texts and real/synthetic images. The visualization solely utilizes time-step $0$.
  • Figure 5: In-the-wild test based on arbitrary textual prompts. Each HOI triplet is represented in the same color.
  • ...and 1 more figures