Table of Contents
Fetching ...

Multi-Modal Building Change Detection for Large-Scale Small Changes: Benchmark and Baseline

Ye Wang, Wei Lu, Zhihui You, Keyan Chen, Tongfei Liu, Kaiyu Li, Hongruixuan Chen, Qingling Shu, Sibao Chen

Abstract

Change detection in optical remote sensing imagery is susceptible to illumination fluctuations, seasonal changes, and variations in surface land-cover materials. Relying solely on RGB imagery often produces pseudo-changes and leads to semantic ambiguity in features. Incorporating near-infrared (NIR) information provides heterogeneous physical cues that are complementary to visible light, thereby enhancing the discriminability of building materials and tiny structures while improving detection accuracy. However, existing multi-modal datasets generally lack high-resolution and accurately registered bi-temporal imagery, and current methods often fail to fully exploit the inherent heterogeneity between these modalities. To address these issues, we introduce the Large-scale Small-change Multi-modal Dataset (LSMD), a bi-temporal RGB-NIR building change detection benchmark dataset targeting small changes in realistic scenarios, providing a rigorous testing platform for evaluating multi-modal change detection methods in complex environments. Based on LSMD, we further propose the Multi-modal Spectral Complementarity Network (MSCNet) to achieve effective cross-modal feature fusion. MSCNet comprises three key components: the Neighborhood Context Enhancement Module (NCEM) to strengthen local spatial details, the Cross-modal Alignment and Interaction Module (CAIM) to enable deep interaction between RGB and NIR features, and the Saliency-aware Multisource Refinement Module (SMRM) to progressively refine fused features. Extensive experiments demonstrate that MSCNet effectively leverages multi-modal information and consistently outperforms existing methods under multiple input configurations, validating its efficacy for fine-grained building change detection. The source code will be made publicly available at: https://github.com/AeroVILab-AHU/LSMD

Multi-Modal Building Change Detection for Large-Scale Small Changes: Benchmark and Baseline

Abstract

Change detection in optical remote sensing imagery is susceptible to illumination fluctuations, seasonal changes, and variations in surface land-cover materials. Relying solely on RGB imagery often produces pseudo-changes and leads to semantic ambiguity in features. Incorporating near-infrared (NIR) information provides heterogeneous physical cues that are complementary to visible light, thereby enhancing the discriminability of building materials and tiny structures while improving detection accuracy. However, existing multi-modal datasets generally lack high-resolution and accurately registered bi-temporal imagery, and current methods often fail to fully exploit the inherent heterogeneity between these modalities. To address these issues, we introduce the Large-scale Small-change Multi-modal Dataset (LSMD), a bi-temporal RGB-NIR building change detection benchmark dataset targeting small changes in realistic scenarios, providing a rigorous testing platform for evaluating multi-modal change detection methods in complex environments. Based on LSMD, we further propose the Multi-modal Spectral Complementarity Network (MSCNet) to achieve effective cross-modal feature fusion. MSCNet comprises three key components: the Neighborhood Context Enhancement Module (NCEM) to strengthen local spatial details, the Cross-modal Alignment and Interaction Module (CAIM) to enable deep interaction between RGB and NIR features, and the Saliency-aware Multisource Refinement Module (SMRM) to progressively refine fused features. Extensive experiments demonstrate that MSCNet effectively leverages multi-modal information and consistently outperforms existing methods under multiple input configurations, validating its efficacy for fine-grained building change detection. The source code will be made publicly available at: https://github.com/AeroVILab-AHU/LSMD
Paper Structure (21 sections, 18 equations, 12 figures, 4 tables)

This paper contains 21 sections, 18 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Illustration of data distribution and challenging scenarios in realistic change detection. Note: unchanged images are excluded in (a) and (b) to focus on change-relevant statistics. (a) Comparison of the average change ratio per image between the proposed LSMD and other mainstream benchmarks. (b) Comparison of image proportions across different change ratios (1%--10%) between the proposed LSMD and existing benchmarks. (c) Visual examples of small-scale changes in large-scale scenes. (d) Visual examples of small buildings under vegetation backgrounds.
  • Figure 2: General architectural paradigm for MCD. Synchronous multi-modal imagery is provided at each temporal phase ($t_1$ and $t_2$). The architecture employs dual-branch feature encoding and independent change perception to capture initial temporal difference cues for each modality. Finally, a Multi-modal Complementary Fusion mechanism is utilized for the synergistic integration of multi-source information, generating highly robust change detection results.
  • Figure 3: Samples and annotations from the LSMD dataset. (Left) Bi-temporal RGB and NIR image pairs with ground truth. (Right) Detailed illustrations of changed regions.
  • Figure 4: Overall architecture of the proposed MSCNet. First, a Siamese backbone is employed to extract bi-temporal RGB and NIR features separately. Next, the Neighborhood Context Enhancement Module (NCEM) enhances the features to capture local change information. The Cross-modal Alignment and Interaction Module (CAIM) then integrates the RGB and NIR difference features to generate a more discriminative fused representation. Subsequently, the Saliency-aware Multisource Refinement Module (SMRM) leverages high-level semantic priors generated offline by RemoteSAM and, under the guidance of semantic masks, multi-modal difference features, and multi-scale context, progressively refines the fused features to restore spatial resolution and maintain cross-scale feature consistency. Finally, the features processed by these modules are used to generate the change detection results.
  • Figure 5: Structure of NCEM. Multi-level neighboring features are selectively aggregated and adaptively weighted to enhance local spatial details and contextual consistency.
  • ...and 7 more figures