Table of Contents
Fetching ...

Just a Few Glances: Open-Set Visual Perception with Image Prompt Paradigm

Jinrong Zhang, Penghui Wang, Chunxiao Liu, Wei Liu, Dian Jin, Qiong Zhang, Erli Meng, Zhengnan Hu

TL;DR

This work tackles open-set perception in detection and segmentation by addressing the limitations of text- and visual-prompt paradigms. It introduces the Image Prompt Paradigm and the MI Grounding framework, which use a small set of automatically curated image prompts and a dedicated image-prompt selection encoder to fuse prompts with multi-scale visual features in a single-stage, non-interactive pipeline. The approach achieves competitive results on standard OSOD/OSS benchmarks and shows notable improvements on cross-domain tasks, with substantial gains on a specialized ADR50K X-ray defect dataset. Overall, it offers a practical path toward fully automated open-set perception suitable for production pipelines and domain-specific applications.

Abstract

To break through the limitations of pre-training models on fixed categories, Open-Set Object Detection (OSOD) and Open-Set Segmentation (OSS) have attracted a surge of interest from researchers. Inspired by large language models, mainstream OSOD and OSS methods generally utilize text as a prompt, achieving remarkable performance. Following SAM paradigm, some researchers use visual prompts, such as points, boxes, and masks that cover detection or segmentation targets. Despite these two prompt paradigms exhibit excellent performance, they also reveal inherent limitations. On the one hand, it is difficult to accurately describe characteristics of specialized category using textual description. On the other hand, existing visual prompt paradigms heavily rely on multi-round human interaction, which hinders them being applied to fully automated pipeline. To address the above issues, we propose a novel prompt paradigm in OSOD and OSS, that is, \textbf{Image Prompt Paradigm}. This brand new prompt paradigm enables to detect or segment specialized categories without multi-round human intervention. To achieve this goal, the proposed image prompt paradigm uses just a few image instances as prompts, and we propose a novel framework named \textbf{MI Grounding} for this new paradigm. In this framework, high-quality image prompts are automatically encoded, selected and fused, achieving the single-stage and non-interactive inference. We conduct extensive experiments on public datasets, showing that MI Grounding achieves competitive performance on OSOD and OSS benchmarks compared to text prompt paradigm methods and visual prompt paradigm methods. Moreover, MI Grounding can greatly outperform existing method on our constructed specialized ADR50K dataset.

Just a Few Glances: Open-Set Visual Perception with Image Prompt Paradigm

TL;DR

This work tackles open-set perception in detection and segmentation by addressing the limitations of text- and visual-prompt paradigms. It introduces the Image Prompt Paradigm and the MI Grounding framework, which use a small set of automatically curated image prompts and a dedicated image-prompt selection encoder to fuse prompts with multi-scale visual features in a single-stage, non-interactive pipeline. The approach achieves competitive results on standard OSOD/OSS benchmarks and shows notable improvements on cross-domain tasks, with substantial gains on a specialized ADR50K X-ray defect dataset. Overall, it offers a practical path toward fully automated open-set perception suitable for production pipelines and domain-specific applications.

Abstract

To break through the limitations of pre-training models on fixed categories, Open-Set Object Detection (OSOD) and Open-Set Segmentation (OSS) have attracted a surge of interest from researchers. Inspired by large language models, mainstream OSOD and OSS methods generally utilize text as a prompt, achieving remarkable performance. Following SAM paradigm, some researchers use visual prompts, such as points, boxes, and masks that cover detection or segmentation targets. Despite these two prompt paradigms exhibit excellent performance, they also reveal inherent limitations. On the one hand, it is difficult to accurately describe characteristics of specialized category using textual description. On the other hand, existing visual prompt paradigms heavily rely on multi-round human interaction, which hinders them being applied to fully automated pipeline. To address the above issues, we propose a novel prompt paradigm in OSOD and OSS, that is, \textbf{Image Prompt Paradigm}. This brand new prompt paradigm enables to detect or segment specialized categories without multi-round human intervention. To achieve this goal, the proposed image prompt paradigm uses just a few image instances as prompts, and we propose a novel framework named \textbf{MI Grounding} for this new paradigm. In this framework, high-quality image prompts are automatically encoded, selected and fused, achieving the single-stage and non-interactive inference. We conduct extensive experiments on public datasets, showing that MI Grounding achieves competitive performance on OSOD and OSS benchmarks compared to text prompt paradigm methods and visual prompt paradigm methods. Moreover, MI Grounding can greatly outperform existing method on our constructed specialized ADR50K dataset.

Paper Structure

This paper contains 12 sections, 10 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Image prompt paradigm vs. previous prompt paradigms. The text prompt paradigm struggles to accurately describe specialized categories. The visual prompt paradigm relies on multi-round human interaction. The proposed image prompt paradigm uses just a few image instances which can handle specialized categories without any manual annotation.
  • Figure 2: The overall framework of MI Grounding. Image prompts are encoded, selected, and integrated through the image prompts selection encoder (IPS encoder) to obtain category-specific prompt features. These prompt features are then deeply fused and aligned with multi-scale features from the input image to achieve open-set visual perception.
  • Figure 3: Quality of image prompts. Green indicates good image prompts, while red indicates poor ones.
  • Figure 4: The overall framework of PFSM.
  • Figure 5: Examples of specialized categories in ADR50K dataset. The text denotes the category name of defects, and the areas within bounding boxs denote the corresponding category region.