Table of Contents
Fetching ...

RSRefSeg: Referring Remote Sensing Image Segmentation with Foundation Models

Keyan Chen, Jiafan Zhang, Chenyang Liu, Zhengxia Zou, Zhenwei Shi

TL;DR

This work tackles fine-grained referring segmentation in remote sensing by introducing RSRefSeg, a foundation-model pipeline that fuses CLIP and SAM via an AttnPrompter to convert coarse text–visual cues into effective prompts for precise segmentation. By applying low-rank fine-tuning to both CLIP and SAM and decomposing text into global and local semantics, RSRefSeg achieves robust cross-domain performance on the RRSIS-D dataset, surpassing prior methods. The approach demonstrates the value of leveraging foundational multimodal models for domain-specific tasks and provides detailed ablations to illuminate design choices, including prompt structure and model adaptation. Practically, RSRefSeg enables accurate, flexible segmentation guided by natural language in remote sensing, with strong implications for fine-grained scene understanding and object extraction in diverse geospatial contexts.

Abstract

Referring remote sensing image segmentation is crucial for achieving fine-grained visual understanding through free-format textual input, enabling enhanced scene and object extraction in remote sensing applications. Current research primarily utilizes pre-trained language models to encode textual descriptions and align them with visual modalities, thereby facilitating the expression of relevant visual features. However, these approaches often struggle to establish robust alignments between fine-grained semantic concepts, leading to inconsistent representations across textual and visual information. To address these limitations, we introduce a referring remote sensing image segmentation foundational model, RSRefSeg. RSRefSeg leverages CLIP for visual and textual encoding, employing both global and local textual semantics as filters to generate referring-related visual activation features in the latent space. These activated features then serve as input prompts for SAM, which refines the segmentation masks through its robust visual generalization capabilities. Experimental results on the RRSIS-D dataset demonstrate that RSRefSeg outperforms existing methods, underscoring the effectiveness of foundational models in enhancing multimodal task comprehension. The code is available at \url{https://github.com/KyanChen/RSRefSeg}.

RSRefSeg: Referring Remote Sensing Image Segmentation with Foundation Models

TL;DR

This work tackles fine-grained referring segmentation in remote sensing by introducing RSRefSeg, a foundation-model pipeline that fuses CLIP and SAM via an AttnPrompter to convert coarse text–visual cues into effective prompts for precise segmentation. By applying low-rank fine-tuning to both CLIP and SAM and decomposing text into global and local semantics, RSRefSeg achieves robust cross-domain performance on the RRSIS-D dataset, surpassing prior methods. The approach demonstrates the value of leveraging foundational multimodal models for domain-specific tasks and provides detailed ablations to illuminate design choices, including prompt structure and model adaptation. Practically, RSRefSeg enables accurate, flexible segmentation guided by natural language in remote sensing, with strong implications for fine-grained scene understanding and object extraction in diverse geospatial contexts.

Abstract

Referring remote sensing image segmentation is crucial for achieving fine-grained visual understanding through free-format textual input, enabling enhanced scene and object extraction in remote sensing applications. Current research primarily utilizes pre-trained language models to encode textual descriptions and align them with visual modalities, thereby facilitating the expression of relevant visual features. However, these approaches often struggle to establish robust alignments between fine-grained semantic concepts, leading to inconsistent representations across textual and visual information. To address these limitations, we introduce a referring remote sensing image segmentation foundational model, RSRefSeg. RSRefSeg leverages CLIP for visual and textual encoding, employing both global and local textual semantics as filters to generate referring-related visual activation features in the latent space. These activated features then serve as input prompts for SAM, which refines the segmentation masks through its robust visual generalization capabilities. Experimental results on the RRSIS-D dataset demonstrate that RSRefSeg outperforms existing methods, underscoring the effectiveness of foundational models in enhancing multimodal task comprehension. The code is available at \url{https://github.com/KyanChen/RSRefSeg}.
Paper Structure (13 sections, 4 equations, 1 figure, 2 tables)

This paper contains 13 sections, 4 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: The overview of the proposed RSRefSeg. The fire icon symbolizes that the model parameters are tuned.