RS2-SAM2: Customized SAM2 for Referring Remote Sensing Image Segmentation
Fu Rong, Meng Lan, Qian Zhang, Lefei Zhang
TL;DR
RS2-SAM2 addresses the challenge of text-guided segmentation in remote sensing by adapting SAM2 through a union encoder that jointly encodes visual and textual inputs, a bidirectional hierarchical fusion module that aligns RS features with textual semantics at multiple scales, and a mask prompt generator that produces dense pseudo-masks to guide SAM2. The approach introduces a text-guided boundary loss and leverages multimodal tokens to produce pixel-level prompts, enabling precise delineation of RS targets. Empirical results on RefSegRS and RRSIS-D show state-of-the-art performance across Pr, mIoU, and oIoU, validating the effectiveness of the dense prompts and hierarchical fusion in handling RS-specific challenges. Overall, RS2-SAM2 offers a robust, end-to-end framework for accurate, text-conditioned RS segmentation with strong generalization to complex remote sensing scenes.
Abstract
Referring Remote Sensing Image Segmentation (RRSIS) aims to segment target objects in remote sensing (RS) images based on textual descriptions. Although Segment Anything Model 2 (SAM2) has shown remarkable performance in various segmentation tasks, its application to RRSIS presents several challenges, including understanding the text-described RS scenes and generating effective prompts from text. To address these issues, we propose \textbf{RS2-SAM2}, a novel framework that adapts SAM2 to RRSIS by aligning the adapted RS features and textual features while providing pseudo-mask-based dense prompts. Specifically, we employ a union encoder to jointly encode the visual and textual inputs, generating aligned visual and text embeddings as well as multimodal class tokens. A bidirectional hierarchical fusion module is introduced to adapt SAM2 to RS scenes and align adapted visual features with the visually enhanced text embeddings, improving the model's interpretation of text-described RS scenes. To provide precise target cues for SAM2, we design a mask prompt generator, which takes the visual embeddings and class tokens as input and produces a pseudo-mask as the dense prompt of SAM2. Experimental results on several RRSIS benchmarks demonstrate that RS2-SAM2 achieves state-of-the-art performance.
