Table of Contents
Fetching ...

Towards Training-free Open-world Segmentation via Image Prompt Foundation Models

Lv Tang, Peng-Tao Jiang, Hao-Ke Xiao, Bo Li

TL;DR

This work pioneers tapping foundation models for open-world understanding through visual concepts conveyed in images via a novel feature interaction module to generate point prompts highlighting target objects in the input image.

Abstract

The realm of computer vision has witnessed a paradigm shift with the advent of foundational models, mirroring the transformative influence of large language models in the domain of natural language processing. This paper delves into the exploration of open-world segmentation, presenting a novel approach called Image Prompt Segmentation (IPSeg) that harnesses the power of vision foundational models. IPSeg lies the principle of a training-free paradigm, which capitalizes on image prompt techniques. Specifically, IPSeg utilizes a single image containing a subjective visual concept as a flexible prompt to query vision foundation models like DINOv2 and Stable Diffusion. Our approach extracts robust features for the prompt image and input image, then matches the input representations to the prompt representations via a novel feature interaction module to generate point prompts highlighting target objects in the input image. The generated point prompts are further utilized to guide the Segment Anything Model to segment the target object in the input image. The proposed method stands out by eliminating the need for exhaustive training sessions, thereby offering a more efficient and scalable solution. Experiments on COCO, PASCAL VOC, and other datasets demonstrate IPSeg's efficacy for flexible open-world segmentation using intuitive image prompts. This work pioneers tapping foundation models for open-world understanding through visual concepts conveyed in images.

Towards Training-free Open-world Segmentation via Image Prompt Foundation Models

TL;DR

This work pioneers tapping foundation models for open-world understanding through visual concepts conveyed in images via a novel feature interaction module to generate point prompts highlighting target objects in the input image.

Abstract

The realm of computer vision has witnessed a paradigm shift with the advent of foundational models, mirroring the transformative influence of large language models in the domain of natural language processing. This paper delves into the exploration of open-world segmentation, presenting a novel approach called Image Prompt Segmentation (IPSeg) that harnesses the power of vision foundational models. IPSeg lies the principle of a training-free paradigm, which capitalizes on image prompt techniques. Specifically, IPSeg utilizes a single image containing a subjective visual concept as a flexible prompt to query vision foundation models like DINOv2 and Stable Diffusion. Our approach extracts robust features for the prompt image and input image, then matches the input representations to the prompt representations via a novel feature interaction module to generate point prompts highlighting target objects in the input image. The generated point prompts are further utilized to guide the Segment Anything Model to segment the target object in the input image. The proposed method stands out by eliminating the need for exhaustive training sessions, thereby offering a more efficient and scalable solution. Experiments on COCO, PASCAL VOC, and other datasets demonstrate IPSeg's efficacy for flexible open-world segmentation using intuitive image prompts. This work pioneers tapping foundation models for open-world understanding through visual concepts conveyed in images.
Paper Structure (28 sections, 6 equations, 10 figures, 8 tables)

This paper contains 28 sections, 6 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Comparison of different open-world segmentation frameworks based on foundation models. From left to right, they are foundation model adaptions, task-specific foundation models training from scratch, and training-free foundation models.
  • Figure 2: Different prompt forms in existing open-world segmentation methods. The left is the prompt of predefined textual descriptions or categories. The middle is the prompt form used in existing one-shot object segmentation works liu2023matcherzhang2023personalize. The right is the prompt form used in this paper, which only uses one image containing a salient object with specific visual concepts.
  • Figure 3: The framework of our proposed IPSeg framework. Importantly, all parameters in the network remain frozen, eliminating the need for additional training. The green point in $\mathcal{P}_\mathcal{G}$ represents the positive point prompts sent to SAM, while the red point represents the negative point prompts sent to SAM.
  • Figure 4: Visualization results of features extracted from different models. The second and fifth columns indicate the use of only the DINOv2 model for feature extraction, while the third and sixth columns denote the use of both DINOv2 and SD models for this purpose.
  • Figure 5: Visualizing the features of foreground objects in the prompt image and all objects in input prompt.
  • ...and 5 more figures