Table of Contents
Fetching ...

Learning to Prompt Segment Anything Models

Jiaxing Huang, Kai Jiang, Jingyi Zhang, Han Qiu, Lewei Lu, Shijian Lu, Eric Xing

TL;DR

This work tackles learning prompts for Segment Anything Models (SAMs) by identifying two core challenges: a limited spatial prompt search space and potential side effects from pretrained text prompt encoders. It introduces SSPrompt, comprising SpaPrompt and SemPrompt, which optimize spatial and semantic prompts directly in embedding space and selectively leverage prompt-encoder knowledge, enabling effective few-shot adaptation of SAMs. Extensive experiments across six segmentation datasets and two backbones demonstrate consistent gains in semantic, instance, and panoptic tasks, including robustness under adverse conditions and improved training efficiency. The results indicate that embedding-space prompt learning with selective encoder usage is a practical and scalable path to task- and domain-specific SAM deployment.

Abstract

Segment Anything Models (SAMs) like SEEM and SAM have demonstrated great potential in learning to segment anything. The core design of SAMs lies with Promptable Segmentation, which takes a handcrafted prompt as input and returns the expected segmentation mask. SAMs work with two types of prompts including spatial prompts (e.g., points) and semantic prompts (e.g., texts), which work together to prompt SAMs to segment anything on downstream datasets. Despite the important role of prompts, how to acquire suitable prompts for SAMs is largely under-explored. In this work, we examine the architecture of SAMs and identify two challenges for learning effective prompts for SAMs. To this end, we propose spatial-semantic prompt learning (SSPrompt) that learns effective semantic and spatial prompts for better SAMs. Specifically, SSPrompt introduces spatial prompt learning and semantic prompt learning, which optimize spatial prompts and semantic prompts directly over the embedding space and selectively leverage the knowledge encoded in pre-trained prompt encoders. Extensive experiments show that SSPrompt achieves superior image segmentation performance consistently across multiple widely adopted datasets.

Learning to Prompt Segment Anything Models

TL;DR

This work tackles learning prompts for Segment Anything Models (SAMs) by identifying two core challenges: a limited spatial prompt search space and potential side effects from pretrained text prompt encoders. It introduces SSPrompt, comprising SpaPrompt and SemPrompt, which optimize spatial and semantic prompts directly in embedding space and selectively leverage prompt-encoder knowledge, enabling effective few-shot adaptation of SAMs. Extensive experiments across six segmentation datasets and two backbones demonstrate consistent gains in semantic, instance, and panoptic tasks, including robustness under adverse conditions and improved training efficiency. The results indicate that embedding-space prompt learning with selective encoder usage is a practical and scalable path to task- and domain-specific SAM deployment.

Abstract

Segment Anything Models (SAMs) like SEEM and SAM have demonstrated great potential in learning to segment anything. The core design of SAMs lies with Promptable Segmentation, which takes a handcrafted prompt as input and returns the expected segmentation mask. SAMs work with two types of prompts including spatial prompts (e.g., points) and semantic prompts (e.g., texts), which work together to prompt SAMs to segment anything on downstream datasets. Despite the important role of prompts, how to acquire suitable prompts for SAMs is largely under-explored. In this work, we examine the architecture of SAMs and identify two challenges for learning effective prompts for SAMs. To this end, we propose spatial-semantic prompt learning (SSPrompt) that learns effective semantic and spatial prompts for better SAMs. Specifically, SSPrompt introduces spatial prompt learning and semantic prompt learning, which optimize spatial prompts and semantic prompts directly over the embedding space and selectively leverage the knowledge encoded in pre-trained prompt encoders. Extensive experiments show that SSPrompt achieves superior image segmentation performance consistently across multiple widely adopted datasets.
Paper Structure (15 sections, 8 equations, 3 figures, 9 tables)

This paper contains 15 sections, 8 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: The architecture of Segment Anything Models (SAMs). SAMs kirillov2023segmentzou2023segment consist of three core parts: (1) a large Image Encoder that encodes input images into image embeddings; (2) prompt encoders including a large Text Prompt Encoder that encodes text tokens into text prompt embeddings and a lightweight Spatial Prompt Encoder that encodes 2D spatial coordinates into spatial prompt embeddings; and (3) a lightweight Spatial Prompt Encoder that predicts the expected segmentation masks based on the image and prompt embeddings.
  • Figure 2: The framework of semantic-spatial prompt learning (SSPrompt). SSPrompt optimizes spatial and semantic prompts directly on the embedding space and selectively leverages the knowledge encoded in prompt encoders: it employs learnable weights to weight the default prompt embeddings ($\{{z}^{S}_{n}\}_{n=1}^{N}$ and $\{{z}^{T}_{c}\}_{c=1}^{C}$) and fuses the weighted embeddings with the learnable prompt embeddings (i.e., $\{\hat{z}^{S}_{n}\}_{n=1}^{N}$ and $\{\hat{z}^{T}_{c}\}_{c=1}^{C}$) to acquire new prompts. During training, only the Learnable Prompt Embeddings and the Learnable Prompt Embeddings are updated (marked by Flame), while all rest are frozen (marked by Snowflake).
  • Figure 3: (a) Text data statistics (used for text prompt encoder pre-training in SAMs zou2023segmentkirillov2023segment). (b) Learnt weights in semantic prompt learning.