SAM-CP: Marrying SAM with Composable Prompts for Versatile Segmentation
Pengfei Chen, Lingxi Xie, Xinyue Huo, Xuehui Yu, Xiaopeng Zhang, Yingfei Sun, Zhenjun Han, Qi Tian
TL;DR
SAM-CP introduces two composable prompts—Prompt I for semantic labeling and Prompt II for instance merging—paired with a unified affinity framework to convert SAM patches into semantic regions and instances. The approach uses semantic and instance queries, a patch encoder, and a dynamic affinity matrix $oldsymbol{A}$ to propagate high-affinity information via cross-attention, with DCA, AR, and QE refinements guiding patch merging. Semantic-level supervision via CLIP-informed logits and instance-level Hungarian matching, combined into a single loss $ ext{L}_{ ext{all}}$, enables effective open-vocabulary segmentation alongside traditional closed-domain tasks. Empirically, SAM-CP achieves state-of-the-art open-vocabulary panoptic/semantic/instance segmentation on COCO/ADE20K/Cityscapes and competitive closed-domain results, demonstrating a general, modular method to imbue vision foundation models with multi-grained semantic perception.
Abstract
The Segment Anything model (SAM) has shown a generalized ability to group image pixels into patches, but applying it to semantic-aware segmentation still faces major challenges. This paper presents SAM-CP, a simple approach that establishes two types of composable prompts beyond SAM and composes them for versatile segmentation. Specifically, given a set of classes (in texts) and a set of SAM patches, the Type-I prompt judges whether a SAM patch aligns with a text label, and the Type-II prompt judges whether two SAM patches with the same text label also belong to the same instance. To decrease the complexity in dealing with a large number of semantic classes and patches, we establish a unified framework that calculates the affinity between (semantic and instance) queries and SAM patches and merges patches with high affinity to the query. Experiments show that SAM-CP achieves semantic, instance, and panoptic segmentation in both open and closed domains. In particular, it achieves state-of-the-art performance in open-vocabulary segmentation. Our research offers a novel and generalized methodology for equipping vision foundation models like SAM with multi-grained semantic perception abilities.
