Table of Contents
Fetching ...

S2C: Learning Noise-Resistant Differences for Unsupervised Change Detection in Multimodal Remote Sensing Images

Lei Ding, Xibing Zuo, Danfeng Hong, Haitao Guo, Jun Lu, Zhihui Gong, Lorenzo Bruzzone

TL;DR

The paper introduces S2C, a noise-resistant, unsupervised learning framework for change detection in multimodal remote sensing, by fusing Visual Foundation Models with contrastive learning. It proposes two novel CL paradigms, Consistency-regularized Temporal Contrast (CTC) and Consistency-regularized Spatial Contrast (CSC), augmented with a grid sparsity loss and an IoU-based refinement to robustly map semantic changes across temporal and modality gaps. A key contribution is the triplet-based temporal difference modeling for UCD and the grid-level sparsity regularizer that promotes compact change maps. The framework extends naturally to unsupervised Multimodal Change Detection (MMCD) and demonstrates substantial improvements over state-of-the-art methods on four benchmark datasets, with notable sample efficiency and cross-modality applicability.

Abstract

Unsupervised Change Detection (UCD) in multimodal Remote Sensing (RS) images remains a difficult challenge due to the inherent spatio-temporal complexity within data, and the heterogeneity arising from different imaging sensors. Inspired by recent advancements in Visual Foundation Models (VFMs) and Contrastive Learning (CL) methodologies, this research aims to develop CL methodologies to translate implicit knowledge in VFM into change representations, thus eliminating the need for explicit supervision. To this end, we introduce a Semantic-to-Change (S2C) learning framework for UCD in both homogeneous and multimodal RS images. Differently from existing CL methodologies that typically focus on learning multi-temporal similarities, we introduce a novel triplet learning strategy that explicitly models temporal differences, which are crucial to the CD task. Furthermore, random spatial and spectral perturbations are introduced during the training to enhance robustness to temporal noise. In addition, a grid sparsity regularization is defined to suppress insignificant changes, and an IoU-matching algorithm is developed to refine the CD results. Experiments on four benchmark CD datasets demonstrate that the proposed S2C learning framework achieves significant improvements in accuracy, surpassing current state-of-the-art by over 31\%, 9\%, 23\%, and 15\%, respectively. It also demonstrates robustness and sample efficiency, suitable for training and adaptation of various Visual Foundation Models (VFMs) or backbone neural networks. The relevant code will be available at: github.com/DingLei14/S2C.

S2C: Learning Noise-Resistant Differences for Unsupervised Change Detection in Multimodal Remote Sensing Images

TL;DR

The paper introduces S2C, a noise-resistant, unsupervised learning framework for change detection in multimodal remote sensing, by fusing Visual Foundation Models with contrastive learning. It proposes two novel CL paradigms, Consistency-regularized Temporal Contrast (CTC) and Consistency-regularized Spatial Contrast (CSC), augmented with a grid sparsity loss and an IoU-based refinement to robustly map semantic changes across temporal and modality gaps. A key contribution is the triplet-based temporal difference modeling for UCD and the grid-level sparsity regularizer that promotes compact change maps. The framework extends naturally to unsupervised Multimodal Change Detection (MMCD) and demonstrates substantial improvements over state-of-the-art methods on four benchmark datasets, with notable sample efficiency and cross-modality applicability.

Abstract

Unsupervised Change Detection (UCD) in multimodal Remote Sensing (RS) images remains a difficult challenge due to the inherent spatio-temporal complexity within data, and the heterogeneity arising from different imaging sensors. Inspired by recent advancements in Visual Foundation Models (VFMs) and Contrastive Learning (CL) methodologies, this research aims to develop CL methodologies to translate implicit knowledge in VFM into change representations, thus eliminating the need for explicit supervision. To this end, we introduce a Semantic-to-Change (S2C) learning framework for UCD in both homogeneous and multimodal RS images. Differently from existing CL methodologies that typically focus on learning multi-temporal similarities, we introduce a novel triplet learning strategy that explicitly models temporal differences, which are crucial to the CD task. Furthermore, random spatial and spectral perturbations are introduced during the training to enhance robustness to temporal noise. In addition, a grid sparsity regularization is defined to suppress insignificant changes, and an IoU-matching algorithm is developed to refine the CD results. Experiments on four benchmark CD datasets demonstrate that the proposed S2C learning framework achieves significant improvements in accuracy, surpassing current state-of-the-art by over 31\%, 9\%, 23\%, and 15\%, respectively. It also demonstrates robustness and sample efficiency, suitable for training and adaptation of various Visual Foundation Models (VFMs) or backbone neural networks. The relevant code will be available at: github.com/DingLei14/S2C.

Paper Structure

This paper contains 20 sections, 16 equations, 9 figures, 6 tables, 1 algorithm.

Figures (9)

  • Figure 1: The major types of temporal noise in CD on HR RSIs include: (a) spectral/seasonal variations, (b) spatial misalignment, (c) insignificant changes, and (d) multimodal heterogeneity.
  • Figure 2: Overview of the proposed S2C framework for UCD. Triplet losses are calculated with bitemporal images and their augmented copies to learn temporal differences; discriminative losses are calculated between bitemporal images of different regions to learn temporal consistency. Random perturbations are introduced to simulate the spectral and spatial variations.
  • Figure 3: Comparison of CL paradigms in CD. (a) Consistency regularization: $f_\theta$ extracts stable representations across weak/strong perturbations; (b) Spatial contrast: $f_\theta$ distinguishes same/different regions; (c) Proposed Consistency-regularized Temporal Contrast (CTC): $f_\theta$ identifies temporal differences independent of spectral or seasonal variations, and (d) Proposed Consistency-regularized Spatial Contrast (CSC): $f_\theta$ distinguishes same/different regions despite perturbations.
  • Figure 4: Illustration of the application of the proposed S2C for UCD in multimodal RS images. This learning framework applies to not only optical and SAR data, but also other image modalities.
  • Figure 5: $F_1$ (%) obtained by S2C with different weighting parameters.
  • ...and 4 more figures