Table of Contents
Fetching ...

Adapting Segment Anything Model for Change Detection in HR Remote Sensing Images

Lei Ding, Kun Zhu, Daifeng Peng, Hao Tang, Kuiwu Yang, Lorenzo Bruzzone

TL;DR

This work targets change detection in very high-resolution remote sensing images by leveraging Vision Foundation Models. It introduces SAM-CD, which adapts FastSAM as a visual encoder with a trainable adaptor and a task-agnostic semantic learning branch to capture semantic latent information and discriminate semantic changes from temporal variations. The method uses multi-scale feature fusion and a temporal constraint loss to align latent representations across dates, achieving superior accuracy across four benchmark datasets and demonstrating competitive, sample-efficient learning compared with semi-supervised approaches. The findings suggest that incorporating semantic priors from VFMs can significantly enhance HR-RSI CD and pave the way toward zero-shot or few-shot CD with further refinements.

Abstract

Vision Foundation Models (VFMs) such as the Segment Anything Model (SAM) allow zero-shot or interactive segmentation of visual contents, thus they are quickly applied in a variety of visual scenes. However, their direct use in many Remote Sensing (RS) applications is often unsatisfactory due to the special imaging characteristics of RS images. In this work, we aim to utilize the strong visual recognition capabilities of VFMs to improve the change detection of high-resolution Remote Sensing Images (RSIs). We employ the visual encoder of FastSAM, an efficient variant of the SAM, to extract visual representations in RS scenes. To adapt FastSAM to focus on some specific ground objects in the RS scenes, we propose a convolutional adaptor to aggregate the task-oriented change information. Moreover, to utilize the semantic representations that are inherent to SAM features, we introduce a task-agnostic semantic learning branch to model the semantic latent in bi-temporal RSIs. The resulting method, SAMCD, obtains superior accuracy compared to the SOTA methods and exhibits a sample-efficient learning ability that is comparable to semi-supervised CD methods. To the best of our knowledge, this is the first work that adapts VFMs for the CD of HR RSIs.

Adapting Segment Anything Model for Change Detection in HR Remote Sensing Images

TL;DR

This work targets change detection in very high-resolution remote sensing images by leveraging Vision Foundation Models. It introduces SAM-CD, which adapts FastSAM as a visual encoder with a trainable adaptor and a task-agnostic semantic learning branch to capture semantic latent information and discriminate semantic changes from temporal variations. The method uses multi-scale feature fusion and a temporal constraint loss to align latent representations across dates, achieving superior accuracy across four benchmark datasets and demonstrating competitive, sample-efficient learning compared with semi-supervised approaches. The findings suggest that incorporating semantic priors from VFMs can significantly enhance HR-RSI CD and pave the way toward zero-shot or few-shot CD with further refinements.

Abstract

Vision Foundation Models (VFMs) such as the Segment Anything Model (SAM) allow zero-shot or interactive segmentation of visual contents, thus they are quickly applied in a variety of visual scenes. However, their direct use in many Remote Sensing (RS) applications is often unsatisfactory due to the special imaging characteristics of RS images. In this work, we aim to utilize the strong visual recognition capabilities of VFMs to improve the change detection of high-resolution Remote Sensing Images (RSIs). We employ the visual encoder of FastSAM, an efficient variant of the SAM, to extract visual representations in RS scenes. To adapt FastSAM to focus on some specific ground objects in the RS scenes, we propose a convolutional adaptor to aggregate the task-oriented change information. Moreover, to utilize the semantic representations that are inherent to SAM features, we introduce a task-agnostic semantic learning branch to model the semantic latent in bi-temporal RSIs. The resulting method, SAMCD, obtains superior accuracy compared to the SOTA methods and exhibits a sample-efficient learning ability that is comparable to semi-supervised CD methods. To the best of our knowledge, this is the first work that adapts VFMs for the CD of HR RSIs.
Paper Structure (17 sections, 10 equations, 7 figures, 10 tables)

This paper contains 17 sections, 10 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Architecture of the proposed SAM-CD.
  • Figure 2: The proposed adaptor network to utilize FastSAM features. Each $\textcircled{\textit{f}}$ denotes a convolutional fusion operation.
  • Figure 3: CD results of the different methods in the ablation study. The predicted maps are compared with the GT maps. The differences are highlighted in color.
  • Figure 4: Visualization of the semantic latent. Warm colors indicate high values and vice versa for cold colors.
  • Figure 5: CD results of the different fully-supervised methods. (a)(b) results on the Levir-CD dataset, (c)(d) results on the WHU-CD dataset, (e)(f) results on the CLCD dataset, (g)(h) results on the S2Looking dataset.
  • ...and 2 more figures