Table of Contents
Fetching ...

Prompt-Free Universal Region Proposal Network

Qihong Tang, Changhan Liu, Shaofeng Zhang, Wenbin Li, Qi Fan, Yang Gao

Abstract

Identifying potential objects is critical for object recognition and analysis across various computer vision applications. Existing methods typically localize potential objects by relying on exemplar images, predefined categories, or textual descriptions. However, their reliance on image and text prompts often limits flexibility, restricting adaptability in real-world scenarios. In this paper, we introduce a novel Prompt-Free Universal Region Proposal Network (PF-RPN), which identifies potential objects without relying on external prompts. First, the Sparse Image-Aware Adapter (SIA) module performs initial localization of potential objects using a learnable query embedding dynamically updated with visual features. Next, the Cascade Self-Prompt (CSP) module identifies the remaining potential objects by leveraging the self-prompted learnable embedding, autonomously aggregating informative visual features in a cascading manner. Finally, the Centerness-Guided Query Selection (CG-QS) module facilitates the selection of high-quality query embeddings using a centerness scoring network. Our method can be optimized with limited data (e.g., 5% of MS COCO data) and applied directly to various object detection application domains for identifying potential objects without fine-tuning, such as underwater object detection, industrial defect detection, and remote sensing image object detection. Experimental results across 19 datasets validate the effectiveness of our method. Code is available at https://github.com/tangqh03/PF-RPN.

Prompt-Free Universal Region Proposal Network

Abstract

Identifying potential objects is critical for object recognition and analysis across various computer vision applications. Existing methods typically localize potential objects by relying on exemplar images, predefined categories, or textual descriptions. However, their reliance on image and text prompts often limits flexibility, restricting adaptability in real-world scenarios. In this paper, we introduce a novel Prompt-Free Universal Region Proposal Network (PF-RPN), which identifies potential objects without relying on external prompts. First, the Sparse Image-Aware Adapter (SIA) module performs initial localization of potential objects using a learnable query embedding dynamically updated with visual features. Next, the Cascade Self-Prompt (CSP) module identifies the remaining potential objects by leveraging the self-prompted learnable embedding, autonomously aggregating informative visual features in a cascading manner. Finally, the Centerness-Guided Query Selection (CG-QS) module facilitates the selection of high-quality query embeddings using a centerness scoring network. Our method can be optimized with limited data (e.g., 5% of MS COCO data) and applied directly to various object detection application domains for identifying potential objects without fine-tuning, such as underwater object detection, industrial defect detection, and remote sensing image object detection. Experimental results across 19 datasets validate the effectiveness of our method. Code is available at https://github.com/tangqh03/PF-RPN.
Paper Structure (20 sections, 5 equations, 9 figures, 13 tables)

This paper contains 20 sections, 5 equations, 9 figures, 13 tables.

Figures (9)

  • Figure 1: Existing visual/text prompt based OVD methods typically rely on predefined categories or exemplar images to propose potential objects for the target image. Recent prompt-free OVD methods often leverage VLMs to generate textual descriptions for the target image to localize potential objects which introduce considerable latency costs. In contrast, our PF-RPN doesn't require any external prompts and only utilizes visual features to generate high-quality proposals. Experimental results show the effectiveness of our PF-RPN in localizing potential objects.
  • Figure 2: Overall architecture of our PF-RPN. It comprises three core components: (1) the Sparse Image-Aware Adapter (SIA) module, which adaptively integrates multi-level feature maps $F^I_i$ with a learnable embedding $F^T$ via a routing mechanism and cross-attention; (2) the Cascade Self-Prompt (CSP) module, which iteratively refines the embedding through masked average pooling across multiple visual levels; and (3) the Centerness-Guided Query Selection (CG-QS) module, which decodes the features into final predictions optimized by contrastive, regression, and centerness losses.
  • Figure 3: Effect of iterations in the Cascade Self-Prompt module. Visualization of region selection across different Cascade Self-Prompt iterations. Green points indicate the object regions selected by the model in the current iteration. As the number of iterations increases, the model progressively selects more object regions in the image, demonstrating the effectiveness of our cascade self-prompt mechanism.
  • Figure 4: Effect of the Sparse Image-Aware Adapter. Visualization of similarity heatmaps between the learnable embedding and image features before and after the module update. Each pair of heatmaps (top: before, bottom: after) corresponds to the same image. After the update, the learnable embedding exhibits stronger responses in semantically relevant regions, indicating improved alignment between visual and learned representations and providing a stronger prior for the cascade self-prompt module.
  • Figure 5: Effect of the Centerness-Guided Query Selection. Visualization of query selection before and after applying the Centerness-Guided Query Selection (CG-QS) module. Each pair of heatmaps (top: before, bottom: after) corresponds to the same image. After applying the CG-QS module, the model tends to select queries near object centers, thereby generating more accurate proposals.
  • ...and 4 more figures