Table of Contents
Fetching ...

RS2-SAM2: Customized SAM2 for Referring Remote Sensing Image Segmentation

Fu Rong, Meng Lan, Qian Zhang, Lefei Zhang

TL;DR

RS2-SAM2 addresses the challenge of text-guided segmentation in remote sensing by adapting SAM2 through a union encoder that jointly encodes visual and textual inputs, a bidirectional hierarchical fusion module that aligns RS features with textual semantics at multiple scales, and a mask prompt generator that produces dense pseudo-masks to guide SAM2. The approach introduces a text-guided boundary loss and leverages multimodal tokens to produce pixel-level prompts, enabling precise delineation of RS targets. Empirical results on RefSegRS and RRSIS-D show state-of-the-art performance across Pr, mIoU, and oIoU, validating the effectiveness of the dense prompts and hierarchical fusion in handling RS-specific challenges. Overall, RS2-SAM2 offers a robust, end-to-end framework for accurate, text-conditioned RS segmentation with strong generalization to complex remote sensing scenes.

Abstract

Referring Remote Sensing Image Segmentation (RRSIS) aims to segment target objects in remote sensing (RS) images based on textual descriptions. Although Segment Anything Model 2 (SAM2) has shown remarkable performance in various segmentation tasks, its application to RRSIS presents several challenges, including understanding the text-described RS scenes and generating effective prompts from text. To address these issues, we propose \textbf{RS2-SAM2}, a novel framework that adapts SAM2 to RRSIS by aligning the adapted RS features and textual features while providing pseudo-mask-based dense prompts. Specifically, we employ a union encoder to jointly encode the visual and textual inputs, generating aligned visual and text embeddings as well as multimodal class tokens. A bidirectional hierarchical fusion module is introduced to adapt SAM2 to RS scenes and align adapted visual features with the visually enhanced text embeddings, improving the model's interpretation of text-described RS scenes. To provide precise target cues for SAM2, we design a mask prompt generator, which takes the visual embeddings and class tokens as input and produces a pseudo-mask as the dense prompt of SAM2. Experimental results on several RRSIS benchmarks demonstrate that RS2-SAM2 achieves state-of-the-art performance.

RS2-SAM2: Customized SAM2 for Referring Remote Sensing Image Segmentation

TL;DR

RS2-SAM2 addresses the challenge of text-guided segmentation in remote sensing by adapting SAM2 through a union encoder that jointly encodes visual and textual inputs, a bidirectional hierarchical fusion module that aligns RS features with textual semantics at multiple scales, and a mask prompt generator that produces dense pseudo-masks to guide SAM2. The approach introduces a text-guided boundary loss and leverages multimodal tokens to produce pixel-level prompts, enabling precise delineation of RS targets. Empirical results on RefSegRS and RRSIS-D show state-of-the-art performance across Pr, mIoU, and oIoU, validating the effectiveness of the dense prompts and hierarchical fusion in handling RS-specific challenges. Overall, RS2-SAM2 offers a robust, end-to-end framework for accurate, text-conditioned RS segmentation with strong generalization to complex remote sensing scenes.

Abstract

Referring Remote Sensing Image Segmentation (RRSIS) aims to segment target objects in remote sensing (RS) images based on textual descriptions. Although Segment Anything Model 2 (SAM2) has shown remarkable performance in various segmentation tasks, its application to RRSIS presents several challenges, including understanding the text-described RS scenes and generating effective prompts from text. To address these issues, we propose \textbf{RS2-SAM2}, a novel framework that adapts SAM2 to RRSIS by aligning the adapted RS features and textual features while providing pseudo-mask-based dense prompts. Specifically, we employ a union encoder to jointly encode the visual and textual inputs, generating aligned visual and text embeddings as well as multimodal class tokens. A bidirectional hierarchical fusion module is introduced to adapt SAM2 to RS scenes and align adapted visual features with the visually enhanced text embeddings, improving the model's interpretation of text-described RS scenes. To provide precise target cues for SAM2, we design a mask prompt generator, which takes the visual embeddings and class tokens as input and produces a pseudo-mask as the dense prompt of SAM2. Experimental results on several RRSIS benchmarks demonstrate that RS2-SAM2 achieves state-of-the-art performance.

Paper Structure

This paper contains 15 sections, 6 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Comparison of two SAM2 adaptations for RRSIS. (a) vanilla SAM2, (b) our RS2-SAM2.
  • Figure 2: The overview of the proposed RS2-SAM2 framework. It consists of four key components: the union encoder, the bidirectional hierarchical fusion module, the mask prompt generator, and SAM2. The union encoder extracts multimodal representations from the input image and text. The bidirectional hierarchical fusion module enhances image features with textual embeddings. The mask prompt generator produces a prior mask as the dense prompt for SAM2. Finally, SAM2 generates precise masks, while the text-guided boundary loss constrains their boundary accuracy.
  • Figure 3: The structure of the bidirectional hierarchical fusion module.
  • Figure 4: Visualization result on RRSIS-D. Compared to RMSIN liu2024rotated, RS2-SAM2 demonstrates superior capability in handling local details and boundary regions.