Table of Contents
Fetching ...

SAM-CP: Marrying SAM with Composable Prompts for Versatile Segmentation

Pengfei Chen, Lingxi Xie, Xinyue Huo, Xuehui Yu, Xiaopeng Zhang, Yingfei Sun, Zhenjun Han, Qi Tian

TL;DR

SAM-CP introduces two composable prompts—Prompt I for semantic labeling and Prompt II for instance merging—paired with a unified affinity framework to convert SAM patches into semantic regions and instances. The approach uses semantic and instance queries, a patch encoder, and a dynamic affinity matrix $oldsymbol{A}$ to propagate high-affinity information via cross-attention, with DCA, AR, and QE refinements guiding patch merging. Semantic-level supervision via CLIP-informed logits and instance-level Hungarian matching, combined into a single loss $ ext{L}_{ ext{all}}$, enables effective open-vocabulary segmentation alongside traditional closed-domain tasks. Empirically, SAM-CP achieves state-of-the-art open-vocabulary panoptic/semantic/instance segmentation on COCO/ADE20K/Cityscapes and competitive closed-domain results, demonstrating a general, modular method to imbue vision foundation models with multi-grained semantic perception.

Abstract

The Segment Anything model (SAM) has shown a generalized ability to group image pixels into patches, but applying it to semantic-aware segmentation still faces major challenges. This paper presents SAM-CP, a simple approach that establishes two types of composable prompts beyond SAM and composes them for versatile segmentation. Specifically, given a set of classes (in texts) and a set of SAM patches, the Type-I prompt judges whether a SAM patch aligns with a text label, and the Type-II prompt judges whether two SAM patches with the same text label also belong to the same instance. To decrease the complexity in dealing with a large number of semantic classes and patches, we establish a unified framework that calculates the affinity between (semantic and instance) queries and SAM patches and merges patches with high affinity to the query. Experiments show that SAM-CP achieves semantic, instance, and panoptic segmentation in both open and closed domains. In particular, it achieves state-of-the-art performance in open-vocabulary segmentation. Our research offers a novel and generalized methodology for equipping vision foundation models like SAM with multi-grained semantic perception abilities.

SAM-CP: Marrying SAM with Composable Prompts for Versatile Segmentation

TL;DR

SAM-CP introduces two composable prompts—Prompt I for semantic labeling and Prompt II for instance merging—paired with a unified affinity framework to convert SAM patches into semantic regions and instances. The approach uses semantic and instance queries, a patch encoder, and a dynamic affinity matrix to propagate high-affinity information via cross-attention, with DCA, AR, and QE refinements guiding patch merging. Semantic-level supervision via CLIP-informed logits and instance-level Hungarian matching, combined into a single loss , enables effective open-vocabulary segmentation alongside traditional closed-domain tasks. Empirically, SAM-CP achieves state-of-the-art open-vocabulary panoptic/semantic/instance segmentation on COCO/ADE20K/Cityscapes and competitive closed-domain results, demonstrating a general, modular method to imbue vision foundation models with multi-grained semantic perception.

Abstract

The Segment Anything model (SAM) has shown a generalized ability to group image pixels into patches, but applying it to semantic-aware segmentation still faces major challenges. This paper presents SAM-CP, a simple approach that establishes two types of composable prompts beyond SAM and composes them for versatile segmentation. Specifically, given a set of classes (in texts) and a set of SAM patches, the Type-I prompt judges whether a SAM patch aligns with a text label, and the Type-II prompt judges whether two SAM patches with the same text label also belong to the same instance. To decrease the complexity in dealing with a large number of semantic classes and patches, we establish a unified framework that calculates the affinity between (semantic and instance) queries and SAM patches and merges patches with high affinity to the query. Experiments show that SAM-CP achieves semantic, instance, and panoptic segmentation in both open and closed domains. In particular, it achieves state-of-the-art performance in open-vocabulary segmentation. Our research offers a novel and generalized methodology for equipping vision foundation models like SAM with multi-grained semantic perception abilities.
Paper Structure (33 sections, 2 equations, 14 figures, 15 tables, 1 algorithm)

This paper contains 33 sections, 2 equations, 14 figures, 15 tables, 1 algorithm.

Figures (14)

  • Figure 1: An illustration of how SAM-CP works at the idea level. Given an image and the patches produced by SAM, we first execute Prompt I to find the patches corresponding to any text label (in either closed or open domains), and then, if necessary, execute Prompt II to group the patches within each class into instances. In the upper part, the height of each bar corresponds to the probability that a patch belongs to a text label (yellow, green, blue and red for 'sand', 'kite', 'sky' and 'person'); in the lower part, two patches are connected by a solid line if they belong to the same instance (purple for 'person-1' and orange for 'person-2'). This figure is best viewed in color.
  • Figure 2: The unified affinity framework as an efficient implementation of SAM-CP. The input image with SAM patches is fed into a patch encoder. Type-I and Type-II prompts appear as two sets of queries. Affinity values are computed and the SAM patches are merged according to the affinity values. Semantic and instance level supervision are added to the merged patches. The purple arrows are present only in the inference stage of open-vocabulary segmentation. Best viewed in color.
  • Figure 3: A qualitative study of how SAM-CP works. Each row displays an example. The leftmost column shows the input image with SAM patches; the middle and right parts show the semantic and instance segmentation results, respectively. We use the t-SNE algorithm to project the learned visual features (by SAM-CP; please refer to Figure \ref{['fig:difference']} for the difference from the features of SAM) in a 2D coordinate system. The points with the same color belong to the same semantic class or the same instance (according to the ground truth). This figure is best viewed in color.
  • Figure 4: The t-SNE visualization upon the visual features of SAM and SAM-CP. Due to the limited space, only semantic segmentation results are displayed. The points with the same color belong to the same semantic class (according to the ground truth). This figure is best viewed in color.
  • Figure 5: Direct classification for parts and 'parts of the whole'.
  • ...and 9 more figures