Table of Contents
Fetching ...

ELGC-Net: Efficient Local-Global Context Aggregation for Remote Sensing Change Detection

Mubashir Noman, Mustansar Fiaz, Hisham Cholakkal, Salman Khan, Fahad Shahbaz Khan

TL;DR

ELGC-Net addresses semantic change detection in high-resolution remote sensing imagery by integrating local spatial details and global contextual cues. The core contribution is the Efficient Local-Global Context Aggregator (ELGCA), which uses a pooled-transpose attention for global context and a depthwise convolution for local context, arranged in a parallel, channel-split design to reduce parameters. The architecture is Siamese, with fusion modules and a decoder, and a lighter ELGC-Net-LW variant achieves comparable accuracy with far fewer parameters and FLOPs, avoiding pre-trained backbones. Evaluations on LEVIR-CD, DSIFN-CD, and CDD-CD demonstrate state-of-the-art performance and robustness across diverse CD tasks, with clear improvements in IoU, F1, and OA metrics. This work offers a practical, efficient CD framework suitable for high-resolution imagery and resource-constrained environments, with potential real-time deployment on edge devices.

Abstract

Deep learning has shown remarkable success in remote sensing change detection (CD), aiming to identify semantic change regions between co-registered satellite image pairs acquired at distinct time stamps. However, existing convolutional neural network and transformer-based frameworks often struggle to accurately segment semantic change regions. Moreover, transformers-based methods with standard self-attention suffer from quadratic computational complexity with respect to the image resolution, making them less practical for CD tasks with limited training data. To address these issues, we propose an efficient change detection framework, ELGC-Net, which leverages rich contextual information to precisely estimate change regions while reducing the model size. Our ELGC-Net comprises a Siamese encoder, fusion modules, and a decoder. The focus of our design is the introduction of an Efficient Local-Global Context Aggregator module within the encoder, capturing enhanced global context and local spatial information through a novel pooled-transpose (PT) attention and depthwise convolution, respectively. The PT attention employs pooling operations for robust feature extraction and minimizes computational cost with transposed attention. Extensive experiments on three challenging CD datasets demonstrate that ELGC-Net outperforms existing methods. Compared to the recent transformer-based CD approach (ChangeFormer), ELGC-Net achieves a 1.4% gain in intersection over union metric on the LEVIR-CD dataset, while significantly reducing trainable parameters. Our proposed ELGC-Net sets a new state-of-the-art performance in remote sensing change detection benchmarks. Finally, we also introduce ELGC-Net-LW, a lighter variant with significantly reduced computational complexity, suitable for resource-constrained settings, while achieving comparable performance. Project url https://github.com/techmn/elgcnet.

ELGC-Net: Efficient Local-Global Context Aggregation for Remote Sensing Change Detection

TL;DR

ELGC-Net addresses semantic change detection in high-resolution remote sensing imagery by integrating local spatial details and global contextual cues. The core contribution is the Efficient Local-Global Context Aggregator (ELGCA), which uses a pooled-transpose attention for global context and a depthwise convolution for local context, arranged in a parallel, channel-split design to reduce parameters. The architecture is Siamese, with fusion modules and a decoder, and a lighter ELGC-Net-LW variant achieves comparable accuracy with far fewer parameters and FLOPs, avoiding pre-trained backbones. Evaluations on LEVIR-CD, DSIFN-CD, and CDD-CD demonstrate state-of-the-art performance and robustness across diverse CD tasks, with clear improvements in IoU, F1, and OA metrics. This work offers a practical, efficient CD framework suitable for high-resolution imagery and resource-constrained environments, with potential real-time deployment on edge devices.

Abstract

Deep learning has shown remarkable success in remote sensing change detection (CD), aiming to identify semantic change regions between co-registered satellite image pairs acquired at distinct time stamps. However, existing convolutional neural network and transformer-based frameworks often struggle to accurately segment semantic change regions. Moreover, transformers-based methods with standard self-attention suffer from quadratic computational complexity with respect to the image resolution, making them less practical for CD tasks with limited training data. To address these issues, we propose an efficient change detection framework, ELGC-Net, which leverages rich contextual information to precisely estimate change regions while reducing the model size. Our ELGC-Net comprises a Siamese encoder, fusion modules, and a decoder. The focus of our design is the introduction of an Efficient Local-Global Context Aggregator module within the encoder, capturing enhanced global context and local spatial information through a novel pooled-transpose (PT) attention and depthwise convolution, respectively. The PT attention employs pooling operations for robust feature extraction and minimizes computational cost with transposed attention. Extensive experiments on three challenging CD datasets demonstrate that ELGC-Net outperforms existing methods. Compared to the recent transformer-based CD approach (ChangeFormer), ELGC-Net achieves a 1.4% gain in intersection over union metric on the LEVIR-CD dataset, while significantly reducing trainable parameters. Our proposed ELGC-Net sets a new state-of-the-art performance in remote sensing change detection benchmarks. Finally, we also introduce ELGC-Net-LW, a lighter variant with significantly reduced computational complexity, suitable for resource-constrained settings, while achieving comparable performance. Project url https://github.com/techmn/elgcnet.
Paper Structure (21 sections, 3 equations, 7 figures, 6 tables)

This paper contains 21 sections, 3 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Example image pairs depicting various challenges related to remote sensing change detection and the corresponding ground-truth segmentation masks. It is highly challenging to ignore semantically irrelevant changes (shown in red boxes) such as (i) shadows and illumination variations, (ii) movable cars and building roof changes (iii) seasonal variations, while accurately segmenting (iv) subtle and large semantic changes (shown in green box).
  • Figure 2: Accuracy (IoU) vs. model size (params) comparison with existing methods on on LEVIR-CD. Our ELGCNet achieves state-of-the-art performance while having $9\times$ less number of model parameters compared to the existing state-of-the-art TransUNetCD li2022transunetcd.
  • Figure 3: Overall architecture of the proposed CD framework. (a) The complete network architecture is presented, illustrating the pre- and post-change input images to the shared encoder. The encoder blocks extract features at four stages and these features are merged by a simple fusion module comprising linear projection, feature concatenation, and a $1\times1$ convolution. These fused feature maps at each stage are then merged in the decoder, which includes several convolutions and transposed convolution layers to obtain upsampled features. The upsampled feature maps are used in the prediction layer to obtain the final change map. (b) The structure of our Encoder block is shown, featuring the Efficient Local-Global Context Aggregation (ELGCA) module and a convolutional MLP. (c) A detailed view of the proposed ELGCA module performing the following key operations: (i) capturing local spatial context using a 3x3 depth-wise convolution ($\bar{X}^i_{lo}$) and (ii) global context aggregation ($A^i_{att}$) through a pooled-transpose (PT) attention operation, and multi-channel feature aggregation using a 1x1 convolution ($Z^i$). To enhance the efficiency of our PT attention, we perform transposed attention ($G$) having linear complexity with the number of tokens, on pooled $Q^i$ and $K^i$ tokens (denoted as $\bar{Q}^i$, $\bar{K}^i$) on a sub-set of channels ($C/4$). The aforementioned operations within our ELGCA module are performed in parallel on different groups (subsets) of channels obtained through channel splitting, leading to improved computational efficiency.
  • Figure 4: ELGC-Net decoder takes fused features $\hat{X}^{i}_{fused}$ from the four stages (where $i = {1,2,3,4}$) and concatenates them along the channel dimension. Then, it utilizes $1 \times 1$ convolution to project features. Afterward, the upsampling is performed twice to obtain the same spatial dimension as the model input using cascaded transpose convolution and a residual block composed of two convolution layers. Finally, a convolution layer is utilized to obtain the prediction scores having two channels.
  • Figure 5: Qualitative results on the LEVIR-CD dataset. We present comparison with the best five existing change detection methods in literature, whose codebases are publicly available. The highlighted region shows that our method (ELGC-Net) is better at detecting the change regions as compared to CNN-based including FC-Siam-diff daudt2018_fcsiam, STANet chen2020spatial, DTCDSCN liu2020building and transformer-based such as BIT chen2021_bit and ChangeFormer changeformer methods.
  • ...and 2 more figures