Table of Contents
Fetching ...

EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model

Yuxuan Zhang, Tianheng Cheng, Lianghui Zhu, Rui Hu, Lei Liu, Heng Liu, Longjin Ran, Xiaoxin Chen, Wenyu Liu, Xinggang Wang

TL;DR

The paper tackles enabling text prompts for the Segment Anything Model to perform Referring Expression Segmentation (RES). It introduces EVF-SAM, which uses an Early Vision-Language Fusion encoder (BEIT-3) to produce prompt embeddings from image+text prompts and feeds them into SAM, all while keeping the SAM image encoder frozen. Through a hybrid training dataset and targeted data strategies (e.g., a [semantic] token), EVF-SAM achieves state-of-the-art average cIoU on RefCOCO/+/g with roughly 1.32B parameters, while substantially reducing parameter counts versus large LLM-based methods. The results, ablations, and efficiency analyses show that early fusion with multimodal prompts is a practical, scalable route for text-guided segmentation that extends RES to semantic- and part-level tasks.

Abstract

Segment Anything Model (SAM) has attracted widespread attention for its superior interactive segmentation capabilities with visual prompts while lacking further exploration of text prompts. In this paper, we empirically investigate what text prompt encoders (e.g., CLIP or LLM) are good for adapting SAM for referring expression segmentation and introduce the Early Vision-language Fusion-based SAM (EVF-SAM). EVF-SAM is a simple yet effective referring segmentation method which exploits multimodal prompts (i.e., image and text) and comprises a pre-trained vision-language model to generate referring prompts and a SAM model for segmentation. Surprisingly, we observe that: (1) multimodal prompts and (2) vision-language models with early fusion (e.g., BEIT-3) are beneficial for prompting SAM for accurate referring segmentation. Our experiments show that the proposed EVF-SAM based on BEIT-3 can obtain state-of-the-art performance on RefCOCO/+/g for referring expression segmentation and demonstrate the superiority of prompting SAM with early vision-language fusion. In addition, the proposed EVF-SAM with 1.32B parameters achieves remarkably higher performance while reducing nearly 82% of parameters compared to previous SAM methods based on large multimodal models.

EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model

TL;DR

The paper tackles enabling text prompts for the Segment Anything Model to perform Referring Expression Segmentation (RES). It introduces EVF-SAM, which uses an Early Vision-Language Fusion encoder (BEIT-3) to produce prompt embeddings from image+text prompts and feeds them into SAM, all while keeping the SAM image encoder frozen. Through a hybrid training dataset and targeted data strategies (e.g., a [semantic] token), EVF-SAM achieves state-of-the-art average cIoU on RefCOCO/+/g with roughly 1.32B parameters, while substantially reducing parameter counts versus large LLM-based methods. The results, ablations, and efficiency analyses show that early fusion with multimodal prompts is a practical, scalable route for text-guided segmentation that extends RES to semantic- and part-level tasks.

Abstract

Segment Anything Model (SAM) has attracted widespread attention for its superior interactive segmentation capabilities with visual prompts while lacking further exploration of text prompts. In this paper, we empirically investigate what text prompt encoders (e.g., CLIP or LLM) are good for adapting SAM for referring expression segmentation and introduce the Early Vision-language Fusion-based SAM (EVF-SAM). EVF-SAM is a simple yet effective referring segmentation method which exploits multimodal prompts (i.e., image and text) and comprises a pre-trained vision-language model to generate referring prompts and a SAM model for segmentation. Surprisingly, we observe that: (1) multimodal prompts and (2) vision-language models with early fusion (e.g., BEIT-3) are beneficial for prompting SAM for accurate referring segmentation. Our experiments show that the proposed EVF-SAM based on BEIT-3 can obtain state-of-the-art performance on RefCOCO/+/g for referring expression segmentation and demonstrate the superiority of prompting SAM with early vision-language fusion. In addition, the proposed EVF-SAM with 1.32B parameters achieves remarkably higher performance while reducing nearly 82% of parameters compared to previous SAM methods based on large multimodal models.
Paper Structure (26 sections, 10 figures, 11 tables)

This paper contains 26 sections, 10 figures, 11 tables.

Figures (10)

  • Figure 1: EVF-SAM achieves competitive performance among various benchmarks for referring expression segmentation.
  • Figure 2: Comparison between late-fusion and early-fusion. We visualize the attention map of representative late-fusion method (i.e., LISA lisa) and early-fusion methods (i.e., EVF-SAM) among different layers. The Y axis represents the 'Query', the X axis represents the 'Key'. We use red, blue and orange boxes to highlight the 'image to text attention', 'text-to-image attention', and targeted segmentation object. The figures demonstrate that the 'text-to-image attention' is more crucial for Referring Expression Segmentation task, while late-fusion methods ignore it. To compensate for this, we propose the early-fusion method for high-quality text-guided visual features.
  • Figure 3: Comparisons of different Text-prompted SAM. (a) A natural idea to support text prompts is to use an off-the-shelf text encoder to generate text embeddings for SAM samrefsam. (b) Several works lisalisa++glamm adopt Large Language Models (LLM) to generate prompt embeddings for SAM in an auto-regressive manner. (c) Our proposed EVF-SAM exploits an effective early vision-language fusion encoder for text-prompted SAM with higher performance and fewer parameters.
  • Figure 4: Architectural explorations for text-prompted SAM. 'L' and 'V' denote the text encoder and vision encoder. We mainly explore three schemes: (a) vanilla baseline with a simple text encoder, (b) multimodal inputs with a late-fusion, i.e., concatenation, and (c) multimodal inputs with early-fusion
  • Figure 5: The overall architecture of EVF-SAM. The proposed EVF-SAM maintains the original architecture of SAM and keeps the weights of the SAM Image Encoder frozen. EVF-SAM exploits the Multimodal Encoder with Early Vision-Language Fusion (EVF) to encode both text prompts and the low-resolution input image (which is resized to $224\times224$). Then the output [CLS] token is projected as prompt embeddings and fed into the prompt encoder of SAM for generating the referring segmentation results.
  • ...and 5 more figures