Table of Contents
Fetching ...

Vision and Language Reference Prompt into SAM for Few-shot Segmentation

Kosuke Sakurai, Ryotaro Shimizu, Masayuki Goto

TL;DR

This work addresses SAM's dependence on per-image prompts and lack of class labels by introducing VLP-SAM, a lightweight, vision-language prompt encoder that fuses a target image, an annotated reference image, and a text label through a multimodal vision-language model to produce SAM prompt embeddings. The VLP encoder leverages pixel-text matching and visual/textual prototypes, augmented with pseudo and attention masks, and uses Transformer-based attention to generate prompts for SAM's mask decoder. On PASCAL-$5^i$ and COCO-$20^i$, VLP-SAM with CLIP Surgery substantially improves performance over VRP-SAM by $6.3$ and $9.5$ in $mIoU$, respectively, while generalizing to unseen objects, demonstrating strong open-vocabulary few-shot segmentation capabilities. The approach is scalable with minimal learnable parameters and leverages large foundational models, with code released for public use.

Abstract

Segment Anything Model (SAM) represents a large-scale segmentation model that enables powerful zero-shot capabilities with flexible prompts. While SAM can segment any object in zero-shot, it requires user-provided prompts for each target image and does not attach any label information to masks. Few-shot segmentation models addressed these issues by inputting annotated reference images as prompts to SAM and can segment specific objects in target images without user-provided prompts. Previous SAM-based few-shot segmentation models only use annotated reference images as prompts, resulting in limited accuracy due to a lack of reference information. In this paper, we propose a novel few-shot segmentation model, Vision and Language reference Prompt into SAM (VLP-SAM), that utilizes the visual information of the reference images and the semantic information of the text labels by inputting not only images but also language as reference information. In particular, VLP-SAM is a simple and scalable structure with minimal learnable parameters, which inputs prompt embeddings with vision-language information into SAM using a multimodal vision-language model. To demonstrate the effectiveness of VLP-SAM, we conducted experiments on the PASCAL-5i and COCO-20i datasets, and achieved high performance in the few-shot segmentation task, outperforming the previous state-of-the-art model by a large margin (6.3% and 9.5% in mIoU, respectively). Furthermore, VLP-SAM demonstrates its generality in unseen objects that are not included in the training data. Our code is available at https://github.com/kosukesakurai1/VLP-SAM.

Vision and Language Reference Prompt into SAM for Few-shot Segmentation

TL;DR

This work addresses SAM's dependence on per-image prompts and lack of class labels by introducing VLP-SAM, a lightweight, vision-language prompt encoder that fuses a target image, an annotated reference image, and a text label through a multimodal vision-language model to produce SAM prompt embeddings. The VLP encoder leverages pixel-text matching and visual/textual prototypes, augmented with pseudo and attention masks, and uses Transformer-based attention to generate prompts for SAM's mask decoder. On PASCAL- and COCO-, VLP-SAM with CLIP Surgery substantially improves performance over VRP-SAM by and in , respectively, while generalizing to unseen objects, demonstrating strong open-vocabulary few-shot segmentation capabilities. The approach is scalable with minimal learnable parameters and leverages large foundational models, with code released for public use.

Abstract

Segment Anything Model (SAM) represents a large-scale segmentation model that enables powerful zero-shot capabilities with flexible prompts. While SAM can segment any object in zero-shot, it requires user-provided prompts for each target image and does not attach any label information to masks. Few-shot segmentation models addressed these issues by inputting annotated reference images as prompts to SAM and can segment specific objects in target images without user-provided prompts. Previous SAM-based few-shot segmentation models only use annotated reference images as prompts, resulting in limited accuracy due to a lack of reference information. In this paper, we propose a novel few-shot segmentation model, Vision and Language reference Prompt into SAM (VLP-SAM), that utilizes the visual information of the reference images and the semantic information of the text labels by inputting not only images but also language as reference information. In particular, VLP-SAM is a simple and scalable structure with minimal learnable parameters, which inputs prompt embeddings with vision-language information into SAM using a multimodal vision-language model. To demonstrate the effectiveness of VLP-SAM, we conducted experiments on the PASCAL-5i and COCO-20i datasets, and achieved high performance in the few-shot segmentation task, outperforming the previous state-of-the-art model by a large margin (6.3% and 9.5% in mIoU, respectively). Furthermore, VLP-SAM demonstrates its generality in unseen objects that are not included in the training data. Our code is available at https://github.com/kosukesakurai1/VLP-SAM.

Paper Structure

This paper contains 13 sections, 4 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Comparison of our proposed method VLP-SAM and three previous models. (a) SAM SAM is a category-agnostic interactive zero-shot segmentation model that inputs user-provided prompts for each target image. (b) PerSAM PerSAM is a training-free SAM-based few-shot segmentation model that inputs positive and negative points on the target image into SAM, without requiring user-provided prompts. (c) VRP-SAM VRP-SAM is a SAM-based few-shot segmentation model incorporating a VRP encoder to generate prompt embeddings for SAM from an annotated reference image. (d) Our proposed method, VLP-SAM, introduces a novel prompt encoder for SAM, called the VLP encoder, which accepts both reference images and text labels as input via a vision-language model.
  • Figure 2: Overview of our proposed method VLP-SAM. VLP-SAM introduces a novel prompt encoder for SAM, called the VLP encoder, which generates the prompt embedding $\bm{Q}'_t$ with vision-language information. In particular, the VLP encoder first embeds the target image $I_t$, the reference image $I_r$, and the text label into the unified embedding space using the VLM encoder. These embeddings are then concatenated with prototypes and masks that aggregate information about the target object and the text label. Finally, the prompt embedding $\bm{Q}'_t$, refined through Transformer-based attention, is fed into SAM's mask decoder, enabling VLP-SAM to produce a more accurate mask of the target object.