Table of Contents
Fetching ...

Visual Prompt Selection for In-Context Learning Segmentation

Wei Suo, Lanqing Lai, Mengyang Sun, Hanwang Zhang, Peng Wang, Yanning Zhang

TL;DR

This work analyzes how visual prompts shape performance in in-context learning for segmentation, revealing that prompt diversity often surpasses similarity-based prompts in guiding accurate masks. It introduces Stepwise Context Search (SCS), which builds a compact yet diverse candidate pool from unlabeled data via clustering and selects well-matched demonstrations with an adaptive search module guided by IoU rewards. Empirical results across COCO-20^i, PASCAL-5^i, and iSALD-5^i show that SCS consistently improves segmentation performance and can outperform existing prompt-selection strategies, achieving near state-of-the-art results in several settings. The approach also reduces annotation costs and is compatible as a plug-in enhancement for existing ICL-based segmentation models like SegGPT.

Abstract

As a fundamental and extensively studied task in computer vision, image segmentation aims to locate and identify different semantic concepts at the pixel level. Recently, inspired by In-Context Learning (ICL), several generalist segmentation frameworks have been proposed, providing a promising paradigm for segmenting specific objects. However, existing works mostly ignore the value of visual prompts or simply apply similarity sorting to select contextual examples. In this paper, we focus on rethinking and improving the example selection strategy. By comprehensive comparisons, we first demonstrate that ICL-based segmentation models are sensitive to different contexts. Furthermore, empirical evidence indicates that the diversity of contextual prompts plays a crucial role in guiding segmentation. Based on the above insights, we propose a new stepwise context search method. Different from previous works, we construct a small yet rich candidate pool and adaptively search the well-matched contexts. More importantly, this method effectively reduces the annotation cost by compacting the search space. Extensive experiments show that our method is an effective strategy for selecting examples and enhancing segmentation performance.

Visual Prompt Selection for In-Context Learning Segmentation

TL;DR

This work analyzes how visual prompts shape performance in in-context learning for segmentation, revealing that prompt diversity often surpasses similarity-based prompts in guiding accurate masks. It introduces Stepwise Context Search (SCS), which builds a compact yet diverse candidate pool from unlabeled data via clustering and selects well-matched demonstrations with an adaptive search module guided by IoU rewards. Empirical results across COCO-20^i, PASCAL-5^i, and iSALD-5^i show that SCS consistently improves segmentation performance and can outperform existing prompt-selection strategies, achieving near state-of-the-art results in several settings. The approach also reduces annotation costs and is compatible as a plug-in enhancement for existing ICL-based segmentation models like SegGPT.

Abstract

As a fundamental and extensively studied task in computer vision, image segmentation aims to locate and identify different semantic concepts at the pixel level. Recently, inspired by In-Context Learning (ICL), several generalist segmentation frameworks have been proposed, providing a promising paradigm for segmenting specific objects. However, existing works mostly ignore the value of visual prompts or simply apply similarity sorting to select contextual examples. In this paper, we focus on rethinking and improving the example selection strategy. By comprehensive comparisons, we first demonstrate that ICL-based segmentation models are sensitive to different contexts. Furthermore, empirical evidence indicates that the diversity of contextual prompts plays a crucial role in guiding segmentation. Based on the above insights, we propose a new stepwise context search method. Different from previous works, we construct a small yet rich candidate pool and adaptively search the well-matched contexts. More importantly, this method effectively reduces the annotation cost by compacting the search space. Extensive experiments show that our method is an effective strategy for selecting examples and enhancing segmentation performance.
Paper Structure (21 sections, 4 equations, 5 figures, 4 tables)

This paper contains 21 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Comparsion of traditional and our method. (a) Existing works rely on dense annotation to build the exampling space. Then, they use the similarity sorting manner to select examples for given queries. (b) Instead, our method significantly alleviates the costs of annotation by searching the typical examples. Moreover, a novel adaptive search module is designed to further select well-matched contexts.
  • Figure 2: The influence of different context selection. (a) Randomly sampling contextual examples with 5 runs under 1-shot and 5-shot setting on PASCAL. (b) For each instance, the nearest and farthest examples are retrieved as visual prompts across different similarity-based sorting. Surprisingly, the most dissimilar examples achieved better performance on $\sim$40% of the test samples.
  • Figure 3: Diversity vs Similarity. Based on similarity sorting, the performance of the two Nearest examples (NN), the two Farthest examples (FF), and the Nearest example with the Farthest example (NF) as visual prompts are shown.
  • Figure 4: Overview of our SCS method. Instead of similarity sorting on a large annotated dataset, we use clustering to select diverse examples from unlabeled data $D$ and construct the candidate pool. Meanwhile, the search agent is used to further select contextual demonstrations for various test samples based on reinforcement learning. During inference, the test sample and candidate pool examples are fed into the model and adaptively search visual prompts.
  • Figure 5: Qualitative results under 1-shot setting on COCO-$20^i$. The similarity-based selection method is viewed as our baseline. The green regions, red regions and purple regions are example masks, predicted masks and ground-truth masks.