Table of Contents
Fetching ...

RSAM-Seg: A SAM-based Approach with Prior Knowledge Integration for Remote Sensing Image Semantic Segmentation

Jie Zhang, Xubing Yang, Rui Jiang, Wei Shao, Li Zhang

TL;DR

RSAM-Seg rethinks general-purpose SAM for remote sensing by removing manual prompts and injecting domain priors via Adapter-Scale and Adapter-Feature. The approach embeds high-frequency and embedding cues into ViT blocks, generating image-informed prompts $P^i$ without user input and preserving the original mask decoder. Across cloud, field, building, and road tasks, RSAM-Seg outperforms SAM and U-Net, demonstrates the ability to recover missing ground-truth regions, and shows promising few-shot performance. This work offers a practical, annotation-efficient solution for robust RS image segmentation with potential utility as an auxiliary annotation tool.

Abstract

The development of high-resolution remote sensing satellites has provided great convenience for research work related to remote sensing. Segmentation and extraction of specific targets are essential tasks when facing the vast and complex remote sensing images. Recently, the introduction of Segment Anything Model (SAM) provides a universal pre-training model for image segmentation tasks. While the direct application of SAM to remote sensing image segmentation tasks does not yield satisfactory results, we propose RSAM-Seg, which stands for Remote Sensing SAM with Semantic Segmentation, as a tailored modification of SAM for the remote sensing field and eliminates the need for manual intervention to provide prompts. Adapter-Scale, a set of supplementary scaling modules, are proposed in the multi-head attention blocks of the encoder part of SAM. Furthermore, Adapter-Feature are inserted between the Vision Transformer (ViT) blocks. These modules aim to incorporate high-frequency image information and image embedding features to generate image-informed prompts. Experiments are conducted on four distinct remote sensing scenarios, encompassing cloud detection, field monitoring, building detection and road mapping tasks . The experimental results not only showcase the improvement over the original SAM and U-Net across cloud, buildings, fields and roads scenarios, but also highlight the capacity of RSAM-Seg to discern absent areas within the ground truth of certain datasets, affirming its potential as an auxiliary annotation method. In addition, the performance in few-shot scenarios is commendable, underscores its potential in dealing with limited datasets.

RSAM-Seg: A SAM-based Approach with Prior Knowledge Integration for Remote Sensing Image Semantic Segmentation

TL;DR

RSAM-Seg rethinks general-purpose SAM for remote sensing by removing manual prompts and injecting domain priors via Adapter-Scale and Adapter-Feature. The approach embeds high-frequency and embedding cues into ViT blocks, generating image-informed prompts without user input and preserving the original mask decoder. Across cloud, field, building, and road tasks, RSAM-Seg outperforms SAM and U-Net, demonstrates the ability to recover missing ground-truth regions, and shows promising few-shot performance. This work offers a practical, annotation-efficient solution for robust RS image segmentation with potential utility as an auxiliary annotation tool.

Abstract

The development of high-resolution remote sensing satellites has provided great convenience for research work related to remote sensing. Segmentation and extraction of specific targets are essential tasks when facing the vast and complex remote sensing images. Recently, the introduction of Segment Anything Model (SAM) provides a universal pre-training model for image segmentation tasks. While the direct application of SAM to remote sensing image segmentation tasks does not yield satisfactory results, we propose RSAM-Seg, which stands for Remote Sensing SAM with Semantic Segmentation, as a tailored modification of SAM for the remote sensing field and eliminates the need for manual intervention to provide prompts. Adapter-Scale, a set of supplementary scaling modules, are proposed in the multi-head attention blocks of the encoder part of SAM. Furthermore, Adapter-Feature are inserted between the Vision Transformer (ViT) blocks. These modules aim to incorporate high-frequency image information and image embedding features to generate image-informed prompts. Experiments are conducted on four distinct remote sensing scenarios, encompassing cloud detection, field monitoring, building detection and road mapping tasks . The experimental results not only showcase the improvement over the original SAM and U-Net across cloud, buildings, fields and roads scenarios, but also highlight the capacity of RSAM-Seg to discern absent areas within the ground truth of certain datasets, affirming its potential as an auxiliary annotation method. In addition, the performance in few-shot scenarios is commendable, underscores its potential in dealing with limited datasets.
Paper Structure (22 sections, 3 equations, 11 figures, 4 tables)

This paper contains 22 sections, 3 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: The structure of RSAM-Seg. Adapter-Feature are inserted between modified ViT blocks while maintaining the mask decoder identical to the original SAM.
  • Figure 2: The structure of the modified transformer block and Adapter-Scale in the encoder of RSAM-Seg.
  • Figure 3: The structure of the Adapter-Feature between the ViT blocks in the encoder of RSAM-Seg.
  • Figure 4: The images in different datasets depict various scenes, including clouds, buildings, fields and roads. The image above shows a remote sensing image, with the corresponding mask displayed below. GT represents ground truth.
  • Figure 5: Comparison of cloud segmentation results on 38-Cloud dataset with RSAM-Seg, SAM and U-Net.
  • ...and 6 more figures