Table of Contents
Fetching ...

SSP-SAM: SAM with Semantic-Spatial Prompt for Referring Expression Segmentation

Wei Tang, Xuejing Liu, Yanpeng Sun, Zechao Li

Abstract

The Segment Anything Model (SAM) excels at general image segmentation but has limited ability to understand natural language, which restricts its direct application in Referring Expression Segmentation (RES). Toward this end, we propose SSP-SAM, a framework that fully utilizes SAM's segmentation capabilities by integrating a Semantic-Spatial Prompt (SSP) encoder. Specifically, we incorporate both visual and linguistic attention adapters into the SSP encoder, which highlight salient objects within the visual features and discriminative phrases within the linguistic features. This design enhances the referent representation for the prompt generator, resulting in high-quality SSPs that enable SAM to generate precise masks guided by language. Although not specifically designed for Generalized RES (GRES), where the referent may correspond to zero, one, or multiple objects, SSP-SAM naturally supports this more flexible setting without additional modifications. Extensive experiments on widely used RES and GRES benchmarks confirm the superiority of our method. Notably, our approach generates segmentation masks of high quality, achieving strong precision even at strict thresholds such as Pr@0.9. Further evaluation on the PhraseCut dataset demonstrates improved performance in open-vocabulary scenarios compared to existing state-of-the-art RES methods. The code and checkpoints are available at: https://github.com/WayneTomas/SSP-SAM.

SSP-SAM: SAM with Semantic-Spatial Prompt for Referring Expression Segmentation

Abstract

The Segment Anything Model (SAM) excels at general image segmentation but has limited ability to understand natural language, which restricts its direct application in Referring Expression Segmentation (RES). Toward this end, we propose SSP-SAM, a framework that fully utilizes SAM's segmentation capabilities by integrating a Semantic-Spatial Prompt (SSP) encoder. Specifically, we incorporate both visual and linguistic attention adapters into the SSP encoder, which highlight salient objects within the visual features and discriminative phrases within the linguistic features. This design enhances the referent representation for the prompt generator, resulting in high-quality SSPs that enable SAM to generate precise masks guided by language. Although not specifically designed for Generalized RES (GRES), where the referent may correspond to zero, one, or multiple objects, SSP-SAM naturally supports this more flexible setting without additional modifications. Extensive experiments on widely used RES and GRES benchmarks confirm the superiority of our method. Notably, our approach generates segmentation masks of high quality, achieving strong precision even at strict thresholds such as Pr@0.9. Further evaluation on the PhraseCut dataset demonstrates improved performance in open-vocabulary scenarios compared to existing state-of-the-art RES methods. The code and checkpoints are available at: https://github.com/WayneTomas/SSP-SAM.
Paper Structure (33 sections, 14 equations, 9 figures, 16 tables)

This paper contains 33 sections, 14 equations, 9 figures, 16 tables.

Figures (9)

  • Figure 1: (a) SAM struggles with text prompts, especially the free-form languages in the Referring Expression Segmentation, resulting in poor segmentation mask. (b) SSP-SAM seamlessly transform images and languages to Semantic-Spatial Prompts to guide SAM for referent segmentation.
  • Figure 2: Illustration of referring expression segmentation (RES) settings. (a) Classic RES focuses on single-target segmentation. (b) and (c) Generalized RES (GRES) introduces multi-target and no-target scenarios, which require not only a comprehensive understanding of all objects in the scene but also a contextual interpretation of both the image and the referring expression.
  • Figure 3: An illustration of SSP-SAM, where SAM is equipped with Semantic-Spatial Prompt (SSP) encoder for RES. The multi-modal features of inputs are extracted by CLIP, and their attention scores are adapted to generate Semantic-Spatial referent features. These features, along with the special tokens, are processed by a prompt generator to identify key information about the referent for the segmentation process. Additionally, an auxiliary REC task is utilized to improve RES performance.
  • Figure 4: An illustration of the attention adapter. It consists of a visual attention adapter and a linguistic attention adapter. The images and texts embedding are extracted from CLIP and then feed to the visual and linguistic attention adapter to obtain the enhanced referent features.
  • Figure 5: Qualitative segmentation results of SSP-SAM on classic RES datasets (RefCOCO, RefCOCO+, RefCOCOg, ReferIt) and the open-vocabulary PhraseCut dataset. SSP-SAM generates masks with sharper boundaries and finer details. Predicted masks are shown in red, and ground truth in green.
  • ...and 4 more figures