Table of Contents
Fetching ...

SAMIC: Segment Anything with In-Context Spatial Prompt Engineering

Savinay Nagendra, Kashif Rashid, Chaopeng Shen, Daniel Kifer

TL;DR

SamIC introduces a compact 2.6M-parameter network that learns to generate task-specific spatial prompts for the Segment Anything Model (SAM), enabling cross-domain few-shot segmentation with minimal labeled data. It pairs a lightweight in-context saliency predictor (HSNet-based) with an annotation tool (SamBox) to collect high-quality prompts, which are converted into saliency heatmaps and then into peak prompts for SAM. Across nine datasets spanning four tasks, SamIC achieves state-of-the-art or competitive results with only 20% of the typical training data, notably excelling in domain-specific segmentation where prior prompt methods falter. The work highlights the potential of learned prompt engineering to leverage vision foundation models for scalable, data-efficient segmentation while acknowledging limitations in medical and video domains tied to SAM's capabilities.

Abstract

Few-shot segmentation is the problem of learning to identify specific types of objects (e.g., airplanes) in images from a small set of labeled reference images. The current state of the art is driven by resource-intensive construction of models for every new domain-specific application. Such models must be trained on enormous labeled datasets of unrelated objects (e.g., cars, trains, animals) so that their ``knowledge'' can be transferred to new types of objects. In this paper, we show how to leverage existing vision foundation models (VFMs) to reduce the incremental cost of creating few-shot segmentation models for new domains. Specifically, we introduce SAMIC, a small network that learns how to prompt VFMs in order to segment new types of objects in domain-specific applications. SAMIC enables any task to be approached as a few-shot learning problem. At 2.6 million parameters, it is 94% smaller than the leading models (e.g., having ResNet 101 backbone with 45+ million parameters). Even using 1/5th of the training data provided by one-shot benchmarks, SAMIC is competitive with, or sets the state of the art, on a variety of few-shot and semantic segmentation datasets including COCO-$20^i$, Pascal-$5^i$, PerSeg, FSS-1000, and NWPU VHR-10.

SAMIC: Segment Anything with In-Context Spatial Prompt Engineering

TL;DR

SamIC introduces a compact 2.6M-parameter network that learns to generate task-specific spatial prompts for the Segment Anything Model (SAM), enabling cross-domain few-shot segmentation with minimal labeled data. It pairs a lightweight in-context saliency predictor (HSNet-based) with an annotation tool (SamBox) to collect high-quality prompts, which are converted into saliency heatmaps and then into peak prompts for SAM. Across nine datasets spanning four tasks, SamIC achieves state-of-the-art or competitive results with only 20% of the typical training data, notably excelling in domain-specific segmentation where prior prompt methods falter. The work highlights the potential of learned prompt engineering to leverage vision foundation models for scalable, data-efficient segmentation while acknowledging limitations in medical and video domains tied to SAM's capabilities.

Abstract

Few-shot segmentation is the problem of learning to identify specific types of objects (e.g., airplanes) in images from a small set of labeled reference images. The current state of the art is driven by resource-intensive construction of models for every new domain-specific application. Such models must be trained on enormous labeled datasets of unrelated objects (e.g., cars, trains, animals) so that their ``knowledge'' can be transferred to new types of objects. In this paper, we show how to leverage existing vision foundation models (VFMs) to reduce the incremental cost of creating few-shot segmentation models for new domains. Specifically, we introduce SAMIC, a small network that learns how to prompt VFMs in order to segment new types of objects in domain-specific applications. SAMIC enables any task to be approached as a few-shot learning problem. At 2.6 million parameters, it is 94% smaller than the leading models (e.g., having ResNet 101 backbone with 45+ million parameters). Even using 1/5th of the training data provided by one-shot benchmarks, SAMIC is competitive with, or sets the state of the art, on a variety of few-shot and semantic segmentation datasets including COCO-, Pascal-, PerSeg, FSS-1000, and NWPU VHR-10.

Paper Structure

This paper contains 19 sections, 4 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Qualitative results of SamIC across diverse downstream segmentation tasks.SamIC has two components, the in-context spatial prompt engineering module that predicts task-specific spatial prompts for target images (bottom row) by learning from a few in-context samples (top row), and the Segment Anything Model (SAM) that takes the spatial prompts as input to produce valid masks. SamIC unifies a diverse set of downstream segmentation tasks including one-shot segmentation, semantic segmentation, video object segmentation and domain-specific semantic segmentation.
  • Figure 2: Comparison of top prompt engineering methods (PerSAM zhang2023personalize, Matcher liu2023matcher) for 1-shot airplane segmentation on a sample from NWPU VHR-10 su2019object. The reference image contains masks of airplanes. The test image has similar airplanes, but the orientation and background differ.
  • Figure 3: The SamBox annotation tool for rapid collection of unambiguous user prompts. With four user-provided spatial point prompts, SamBox outputs a mask with saturated confidence score of 1.009
  • Figure 4: Saliency-like 2D Gaussian heatmap generated from user-provided prompt with SamBox.
  • Figure 5: Overview of the SamIC architecture. HSNet min2021hypercorrelation is used as our in-context visual saliency prediction architecture to predict a saliency-like heat map with 2D Gaussians, representing location priors for task-specific spatial point prompts. A peak finding algorithm is used to extract a sequence of point prompts from the predicted heat map that are provided to SAM to generate segmentation masks.
  • ...and 8 more figures