Table of Contents
Fetching ...

Multi-Receptive Field Ensemble with Cross-Entropy Masking for Class Imbalance in Remote Sensing Change Detection

Humza Naveed, Xina Zeng, Mitch Bryson, Nagita Mehrseresht

TL;DR

This paper tackles RSCD under multi-scale changes and severe class imbalance by combining a Siamese FastSAM encoder with a multi-receptive field ensemble for spatio-temporal feature learning. It introduces a decoder ensemble and a multi-scale decoder fusion with attention (MSDFA) to fuse information across scales, followed by a classification head, all trained with a novel cross-entropy masking (CEM) loss that drops easy background pixels during optimization. On four RSCD datasets, the proposed SAM-ECEM method achieves state-of-the-art performance, including a 2.97% improvement in F1 on the challenging S2Looking dataset, and notable gains on Levir-CD, WHU-CD, and CLCD, demonstrating strong generalization. The approach provides practical benefits for remote sensing change detection by effectively balancing locality and global context while addressing data imbalance, with code available at the provided repository.

Abstract

Remote sensing change detection (RSCD) is a complex task, where changes often appear at different scales and orientations. Convolutional neural networks (CNNs) are good at capturing local spatial patterns but cannot model global semantics due to limited receptive fields. Alternatively, transformers can model long-range dependencies but are data hungry, and RSCD datasets are not large enough to train these models effectively. To tackle this, this paper presents a new architecture for RSCD which adapts a segment anything (SAM) vision foundation model and processes features from the SAM encoder through a multi-receptive field ensemble to capture local and global change patterns. We propose an ensemble of spatial-temporal feature enhancement (STFE) to capture cross-temporal relations, a decoder to reconstruct change patterns, and a multi-scale decoder fusion with attention (MSDFA) to fuse multi-scale decoder information and highlight key change patterns. Each branch in an ensemble operates on a separate receptive field to capture finer-to-coarser level details. Additionally, we propose a novel cross-entropy masking (CEM) loss to handle class-imbalance in RSCD datasets. Our work outperforms state-of-the-art (SOTA) methods on four change detection datasets, Levir-CD, WHU-CD, CLCD, and S2Looking. We achieved 2.97\% F1-score improvement on a complex S2Looking dataset. The code is available at: https://github.com/humza909/SAM-ECEM

Multi-Receptive Field Ensemble with Cross-Entropy Masking for Class Imbalance in Remote Sensing Change Detection

TL;DR

This paper tackles RSCD under multi-scale changes and severe class imbalance by combining a Siamese FastSAM encoder with a multi-receptive field ensemble for spatio-temporal feature learning. It introduces a decoder ensemble and a multi-scale decoder fusion with attention (MSDFA) to fuse information across scales, followed by a classification head, all trained with a novel cross-entropy masking (CEM) loss that drops easy background pixels during optimization. On four RSCD datasets, the proposed SAM-ECEM method achieves state-of-the-art performance, including a 2.97% improvement in F1 on the challenging S2Looking dataset, and notable gains on Levir-CD, WHU-CD, and CLCD, demonstrating strong generalization. The approach provides practical benefits for remote sensing change detection by effectively balancing locality and global context while addressing data imbalance, with code available at the provided repository.

Abstract

Remote sensing change detection (RSCD) is a complex task, where changes often appear at different scales and orientations. Convolutional neural networks (CNNs) are good at capturing local spatial patterns but cannot model global semantics due to limited receptive fields. Alternatively, transformers can model long-range dependencies but are data hungry, and RSCD datasets are not large enough to train these models effectively. To tackle this, this paper presents a new architecture for RSCD which adapts a segment anything (SAM) vision foundation model and processes features from the SAM encoder through a multi-receptive field ensemble to capture local and global change patterns. We propose an ensemble of spatial-temporal feature enhancement (STFE) to capture cross-temporal relations, a decoder to reconstruct change patterns, and a multi-scale decoder fusion with attention (MSDFA) to fuse multi-scale decoder information and highlight key change patterns. Each branch in an ensemble operates on a separate receptive field to capture finer-to-coarser level details. Additionally, we propose a novel cross-entropy masking (CEM) loss to handle class-imbalance in RSCD datasets. Our work outperforms state-of-the-art (SOTA) methods on four change detection datasets, Levir-CD, WHU-CD, CLCD, and S2Looking. We achieved 2.97\% F1-score improvement on a complex S2Looking dataset. The code is available at: https://github.com/humza909/SAM-ECEM

Paper Structure

This paper contains 17 sections, 9 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: An illustration of dropped pixels in cross-entropy masking to handle class imbalance. The first three images are input images and ground truth, while the last image is an overlay with red pixels representing those dropped during loss calculation.
  • Figure 2: The architectural diagram of SAM-CEM-CD. The red, yellow, and green colored boxes represent an ensemble in spatio-temporal feature enhancement (STFE) and decoder, where each color operates on a separate receptive field from $1\times 1$, $3\times 3$, and $5\times 5$. The output of the red STFE boxes only connects with the red boxes in the decoder and so on.
  • Figure 3: Comparison of change detection results. T1 and T2 are the input images at different timestamps, GT is the ground truth, and CF changeformer, BIT bit, CGNet cgnet, SAM-CD samcd, and SAM-ECEM (ours) are predictions from models. Red pixels are FPs and blue pixels are FNs.