Table of Contents
Fetching ...

Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection

Ting Lei, Shaofeng Yin, Yuxin Peng, Yang Liu

TL;DR

This work tackles zero-shot human-object interaction detection by introducing CMMP, a framework that decouples interactiveness-aware visual feature extraction from generalizable interaction classification using conditional multi-modal prompts. It injects instance-level and global spatial priors into the image encoder and enforces a consistency constraint on language prompts to preserve CLIP knowledge, enabling better transfer to unseen HOIs and verbs. Through a two-stage HOI detection pipeline and extensive ablations on the HICO-DET dataset, CMMP achieves state-of-the-art performance for unseen classes across multiple zero-shot settings and demonstrates robust generalization to novel actions. The approach offers a practical pathway to scalable, generalizable HOI understanding in real-world scenes, with code and models released for reproducibility.

Abstract

Zero-shot Human-Object Interaction (HOI) detection has emerged as a frontier topic due to its capability to detect HOIs beyond a predefined set of categories. This task entails not only identifying the interactiveness of human-object pairs and localizing them but also recognizing both seen and unseen interaction categories. In this paper, we introduce a novel framework for zero-shot HOI detection using Conditional Multi-Modal Prompts, namely CMMP. This approach enhances the generalization of large foundation models, such as CLIP, when fine-tuned for HOI detection. Unlike traditional prompt-learning methods, we propose learning decoupled vision and language prompts for interactiveness-aware visual feature extraction and generalizable interaction classification, respectively. Specifically, we integrate prior knowledge of different granularity into conditional vision prompts, including an input-conditioned instance prior and a global spatial pattern prior. The former encourages the image encoder to treat instances belonging to seen or potentially unseen HOI concepts equally while the latter provides representative plausible spatial configuration of the human and object under interaction. Besides, we employ language-aware prompt learning with a consistency constraint to preserve the knowledge of the large foundation model to enable better generalization in the text branch. Extensive experiments demonstrate the efficacy of our detector with conditional multi-modal prompts, outperforming previous state-of-the-art on unseen classes of various zero-shot settings. The code and models are available at \url{https://github.com/ltttpku/CMMP}.

Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection

TL;DR

This work tackles zero-shot human-object interaction detection by introducing CMMP, a framework that decouples interactiveness-aware visual feature extraction from generalizable interaction classification using conditional multi-modal prompts. It injects instance-level and global spatial priors into the image encoder and enforces a consistency constraint on language prompts to preserve CLIP knowledge, enabling better transfer to unseen HOIs and verbs. Through a two-stage HOI detection pipeline and extensive ablations on the HICO-DET dataset, CMMP achieves state-of-the-art performance for unseen classes across multiple zero-shot settings and demonstrates robust generalization to novel actions. The approach offers a practical pathway to scalable, generalizable HOI understanding in real-world scenes, with code and models released for reproducibility.

Abstract

Zero-shot Human-Object Interaction (HOI) detection has emerged as a frontier topic due to its capability to detect HOIs beyond a predefined set of categories. This task entails not only identifying the interactiveness of human-object pairs and localizing them but also recognizing both seen and unseen interaction categories. In this paper, we introduce a novel framework for zero-shot HOI detection using Conditional Multi-Modal Prompts, namely CMMP. This approach enhances the generalization of large foundation models, such as CLIP, when fine-tuned for HOI detection. Unlike traditional prompt-learning methods, we propose learning decoupled vision and language prompts for interactiveness-aware visual feature extraction and generalizable interaction classification, respectively. Specifically, we integrate prior knowledge of different granularity into conditional vision prompts, including an input-conditioned instance prior and a global spatial pattern prior. The former encourages the image encoder to treat instances belonging to seen or potentially unseen HOI concepts equally while the latter provides representative plausible spatial configuration of the human and object under interaction. Besides, we employ language-aware prompt learning with a consistency constraint to preserve the knowledge of the large foundation model to enable better generalization in the text branch. Extensive experiments demonstrate the efficacy of our detector with conditional multi-modal prompts, outperforming previous state-of-the-art on unseen classes of various zero-shot settings. The code and models are available at \url{https://github.com/ltttpku/CMMP}.
Paper Structure (22 sections, 10 equations, 3 figures, 4 tables)

This paper contains 22 sections, 10 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: (a) Previous detectors struggle with the delicate balance between seen and unseen classes, resulting in a low harmonic mean (HM) and poor performance on unseen classes. In contrast, our method effectively addresses this balance issue, leading to significant improvement and establishing a new state-of-the-art benchmark for unseen classes. (b) Our model uses visual spatial cues during feature extraction to help recognize the interactiveness of unseen HOI concepts and utilize constraint prompt learning for better generalizability on unseen classes.
  • Figure 2: The overall framework of CMMP. The proposed method splits zero-shot HOI detection into two subtasks: interactiveness-aware visual feature extraction and generalizable interaction classification. We propose decoupled vision and text prompts for each subtask to eliminate the dependence between them and break error-propagation in-between. The conditional vision prompts ($P_V$) are used to inject spatial- and interactiveness-aware knowledge into the image encoder and are explicitly constrained by instance-level visual prior ($C_{ins}$) and global spatial pattern ($C_{GSP}$). The conditional language prompts ($P_L$) are constrained by the human-designed prompts ($C_L$) through a regularization loss. (Best viewed in color.)
  • Figure 3: Visualization of successfully detected HOIs in the unseen verb setting. Each detected human-object pair is connected by a red line, with the corresponding interaction score overlaid above the human box. All the images contain unseen HOIs made up of unseen verbs and seen objects.