Table of Contents
Fetching ...

DiPEx: Dispersing Prompt Expansion for Class-Agnostic Object Detection

Jia Syuen Lim, Zhuoxiao Chen, Mahsa Baktashmotlagh, Zhi Chen, Xin Yu, Zi Huang, Yadan Luo

TL;DR

This paper tackles class-agnostic object detection by identifying semantic overlap in hand-crafted prompts as a key recall bottleneck for vision-language models. It introduces Dispersing Prompt Expansion (DiPEx), a self-supervised method that progressively expands a tree of non-overlapping hyperspherical prompts, guided by uncertainty and dispersion losses, with termination via Maximum Angular Coverage. Empirical results on MS-COCO and LVIS show DiPEx achieving state-of-the-art AR and AP in CA-OD and strong gains in OOD-OD, notably surpassing SAM and prompting baselines in multiple metrics. The approach enables robust, one-pass inference while maintaining a semantic hierarchy among prompts, albeit with increased computational overhead due to iterative self-training and hyperparameter sensitivity.

Abstract

Class-agnostic object detection (OD) can be a cornerstone or a bottleneck for many downstream vision tasks. Despite considerable advancements in bottom-up and multi-object discovery methods that leverage basic visual cues to identify salient objects, consistently achieving a high recall rate remains difficult due to the diversity of object types and their contextual complexity. In this work, we investigate using vision-language models (VLMs) to enhance object detection via a self-supervised prompt learning strategy. Our initial findings indicate that manually crafted text queries often result in undetected objects, primarily because detection confidence diminishes when the query words exhibit semantic overlap. To address this, we propose a Dispersing Prompt Expansion (DiPEx) approach. DiPEx progressively learns to expand a set of distinct, non-overlapping hyperspherical prompts to enhance recall rates, thereby improving performance in downstream tasks such as out-of-distribution OD. Specifically, DiPEx initiates the process by self-training generic parent prompts and selecting the one with the highest semantic uncertainty for further expansion. The resulting child prompts are expected to inherit semantics from their parent prompts while capturing more fine-grained semantics. We apply dispersion losses to ensure high inter-class discrepancy among child prompts while preserving semantic consistency between parent-child prompt pairs. To prevent excessive growth of the prompt sets, we utilize the maximum angular coverage (MAC) of the semantic space as a criterion for early termination. We demonstrate the effectiveness of DiPEx through extensive class-agnostic OD and OOD-OD experiments on MS-COCO and LVIS, surpassing other prompting methods by up to 20.1\% in AR and achieving a 21.3\% AP improvement over SAM. The code is available at https://github.com/jason-lim26/DiPEx.

DiPEx: Dispersing Prompt Expansion for Class-Agnostic Object Detection

TL;DR

This paper tackles class-agnostic object detection by identifying semantic overlap in hand-crafted prompts as a key recall bottleneck for vision-language models. It introduces Dispersing Prompt Expansion (DiPEx), a self-supervised method that progressively expands a tree of non-overlapping hyperspherical prompts, guided by uncertainty and dispersion losses, with termination via Maximum Angular Coverage. Empirical results on MS-COCO and LVIS show DiPEx achieving state-of-the-art AR and AP in CA-OD and strong gains in OOD-OD, notably surpassing SAM and prompting baselines in multiple metrics. The approach enables robust, one-pass inference while maintaining a semantic hierarchy among prompts, albeit with increased computational overhead due to iterative self-training and hyperparameter sensitivity.

Abstract

Class-agnostic object detection (OD) can be a cornerstone or a bottleneck for many downstream vision tasks. Despite considerable advancements in bottom-up and multi-object discovery methods that leverage basic visual cues to identify salient objects, consistently achieving a high recall rate remains difficult due to the diversity of object types and their contextual complexity. In this work, we investigate using vision-language models (VLMs) to enhance object detection via a self-supervised prompt learning strategy. Our initial findings indicate that manually crafted text queries often result in undetected objects, primarily because detection confidence diminishes when the query words exhibit semantic overlap. To address this, we propose a Dispersing Prompt Expansion (DiPEx) approach. DiPEx progressively learns to expand a set of distinct, non-overlapping hyperspherical prompts to enhance recall rates, thereby improving performance in downstream tasks such as out-of-distribution OD. Specifically, DiPEx initiates the process by self-training generic parent prompts and selecting the one with the highest semantic uncertainty for further expansion. The resulting child prompts are expected to inherit semantics from their parent prompts while capturing more fine-grained semantics. We apply dispersion losses to ensure high inter-class discrepancy among child prompts while preserving semantic consistency between parent-child prompt pairs. To prevent excessive growth of the prompt sets, we utilize the maximum angular coverage (MAC) of the semantic space as a criterion for early termination. We demonstrate the effectiveness of DiPEx through extensive class-agnostic OD and OOD-OD experiments on MS-COCO and LVIS, surpassing other prompting methods by up to 20.1\% in AR and achieving a 21.3\% AP improvement over SAM. The code is available at https://github.com/jason-lim26/DiPEx.
Paper Structure (16 sections, 4 equations, 10 figures, 5 tables, 1 algorithm)

This paper contains 16 sections, 4 equations, 10 figures, 5 tables, 1 algorithm.

Figures (10)

  • Figure 1: (a) An exemplar of the studied class-agnostic OD and downstream OOD-OD tasks. (B) Zero-shot class-agnostic OD performance of Grounding DINO liu2023grounding on MS-COCO DBLP:conf/eccv/LinMBHPRDZ14, with the hand-crafted Universal query from ChatGPT and Class-wide query from WordNet fellbaum1998wordnet.
  • Figure 2: A case study investigating the impact of semantic overlap between text queries on the detection confidence of the pre-trained Grounding DINO liu2023grounding. Semantic overlaps are quantified by the angular distance, denoted as $\Theta$, between tokenized embeddings of word pairs using BERT DBLP:conf/naacl/DevlinCLT19.
  • Figure 3: An illustration of the ① proposed prompt expansion strategy that selectively grows a set of child prompts for the highlighted parent prompt across $L$ iterations; ② diversifying initialized embeddings of the child prompt on a hypersphere and ③ quantifying maximum angular coverage $\alpha_{\operatorname{max}}$ for early termination of the prompt growth.
  • Figure 4: Impact of the prompt length on the MS-COCO dataset. The average recall (AR) and precision (AP) are reported to compare the derived DiPEx against CoOp zhou2022coop and CoCoOp zhou2022cocoop.
  • Figure 5: The heatmap visualization presents the angular coverage across all learned prompts through the 2nd, the 3rd, and the 4th round of training. The maximum angular coverage (MAC) monotonically increases from 67.7° in the 2nd round to 75.95° in the final round. The gradual reduction in rate of change in angular coverage towards the final round suggests that the model nearing convergence.
  • ...and 5 more figures