Table of Contents
Fetching ...

GRAD-Former: Gated Robust Attention-based Differential Transformer for Change Detection

Durgesh Ameta, Ujjwal Mishra, Praful Hambarde, Amit Shukla

TL;DR

This work presents GRAD-Former, a novel framework that enhances contextual understanding while maintaining efficiency through reduced model size, and establishes a new benchmark for remote sensing change detection performance.

Abstract

Change detection (CD) in remote sensing aims to identify semantic differences between satellite images captured at different times. While deep learning has significantly advanced this field, existing approaches based on convolutional neural networks (CNNs), transformers and Selective State Space Models (SSMs) still struggle to precisely delineate change regions. In particular, traditional transformer-based methods suffer from quadratic computational complexity when applied to very high-resolution (VHR) satellite images and often perform poorly with limited training data, leading to under-utilization of the rich spatial information available in VHR imagery. We present GRAD-Former, a novel framework that enhances contextual understanding while maintaining efficiency through reduced model size. The proposed framework consists of a novel encoder with Adaptive Feature Relevance and Refinement (AFRAR) module, fusion and decoder blocks. AFRAR integrates global-local contextual awareness through two proposed components: the Selective Embedding Amplification (SEA) module and the Global-Local Feature Refinement (GLFR) module. SEA and GLFR leverage gating mechanisms and differential attention, respectively, which generates multiple softmax heaps to capture important features while minimizing the captured irreverent features. Multiple experiments across three challenging CD datasets (LEVIR-CD, CDD, DSIFN-CD) demonstrate GRAD-Former's superior performance compared to existing approaches. Notably, GRAD-Former outperforms the current state-of-the-art models across all the metrics and all the datasets while using fewer parameters. Our framework establishes a new benchmark for remote sensing change detection performance. Our code will be released at: https://github.com/Ujjwal238/GRAD-Former

GRAD-Former: Gated Robust Attention-based Differential Transformer for Change Detection

TL;DR

This work presents GRAD-Former, a novel framework that enhances contextual understanding while maintaining efficiency through reduced model size, and establishes a new benchmark for remote sensing change detection performance.

Abstract

Change detection (CD) in remote sensing aims to identify semantic differences between satellite images captured at different times. While deep learning has significantly advanced this field, existing approaches based on convolutional neural networks (CNNs), transformers and Selective State Space Models (SSMs) still struggle to precisely delineate change regions. In particular, traditional transformer-based methods suffer from quadratic computational complexity when applied to very high-resolution (VHR) satellite images and often perform poorly with limited training data, leading to under-utilization of the rich spatial information available in VHR imagery. We present GRAD-Former, a novel framework that enhances contextual understanding while maintaining efficiency through reduced model size. The proposed framework consists of a novel encoder with Adaptive Feature Relevance and Refinement (AFRAR) module, fusion and decoder blocks. AFRAR integrates global-local contextual awareness through two proposed components: the Selective Embedding Amplification (SEA) module and the Global-Local Feature Refinement (GLFR) module. SEA and GLFR leverage gating mechanisms and differential attention, respectively, which generates multiple softmax heaps to capture important features while minimizing the captured irreverent features. Multiple experiments across three challenging CD datasets (LEVIR-CD, CDD, DSIFN-CD) demonstrate GRAD-Former's superior performance compared to existing approaches. Notably, GRAD-Former outperforms the current state-of-the-art models across all the metrics and all the datasets while using fewer parameters. Our framework establishes a new benchmark for remote sensing change detection performance. Our code will be released at: https://github.com/Ujjwal238/GRAD-Former
Paper Structure (25 sections, 18 equations, 7 figures, 5 tables)

This paper contains 25 sections, 18 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Example images showing different challenges in the CD task, along with their ground-truth change masks. It is especially difficult to ignore irrelevant changes (marked in red boxes), such as (a) shadows and lighting differences, (b) seasonal variations, and (c) moving cars and roof changes, while accurately detecting (d) both large and small meaningful changes (marked in the green box).
  • Figure 2: The overall design of the proposed GRAD-Former is depicted as follows: (a) The full network structure is outlined, showcasing the pre- and post-change input images processed through a shared encoder. This encoder extracts features across four stages, which are then combined using DA module. At each stage, the fused feature maps are merged within the decoder, which employs multiple convolutional and transposed convolution layers to generate upsampled features. These upsampled feature maps are ultimately utilized in the prediction layer to produce the final change map. (b) Detailed architecture of the decoder, it processes the fused feature maps $\hat{F}_{fuse}^{i}$ extracted from the four stages ($i = 1, 2, 3, 4$) by concatenating them along the channel dimension. It then applies a $1 \times 1$ convolution to project the features. Subsequently, upsampling is performed twice to restore the spatial dimensions to match the model input, utilizing a cascaded transpose convolution followed by a residual block consisting of two convolutional layers. Finally, a convolutional layer is employed to generate the prediction scores. (c) illustrates the encoder block structure, incorporating the novel Adaptive Feature Relevance and Refinement (AFRAR) module alongside a convolutional MLP layer. Instance normalization is applied across channels before input is fed into the AFRAR and MLP layers. (d) The detailed architecture of the AFRAR module is shown, which first splits the input ${\mathcal{F}}_{norm}^i$ along the channel and combines SEA and GLFR modules to extract relevant features, whose outputs are concatenated to form $\bar{\mathcal{F}}_{norm}^i$.
  • Figure 3: Detailed illustration of the proposed Difference Amalgamation (DA) module that captures semantic and differential features by concatenating $\hat{\mathcal{F}}_{pre}^i$ and $\hat{\mathcal{F}}_{post}^i$ along the channel dimension, along with their difference. Next, a convolutional layer is applied to reduce the channel count, followed by an activation layer within the DA module, producing $\hat{\mathcal{F}}_{fuse}^i$, where $i \in [1,2,3,4]$.)).
  • Figure 4: (a) Overview of the proposed Global-Local Feature Refinement (GLFR) Module, An attention mechanism designed to overcome the diffused focus problem in transformers by implementing differential multi-head attention. The module generates $Q$, $K$, $V$ matrices from input $\mathcal{F}^i_{\text{GLFR}}$ and splits $Q$, $K$ into pairs to calculate dual softmax attention maps ($A_1$, $A_2$). By computing the difference $A=A_1-\lambda \cdot A_2$ with learnable scaling factor $\lambda$, the module creates sparse attention patterns that focus exclusively on relevant features while filtering out noise. The resulting attention output is combined with local features, providing an efficient balance between global context and local detail without the computational burden typical of transformer architectures. (b) depicts the proposed Selective Embedding Amplification (SEA) module that selectively amplifies important features through gating mechanism. The input features $\mathcal{F}^i_{\text{SEA}}$ undergo $L2$ normalization and multiplication by learnable parameter $\alpha$. Using another learnable parameter $\gamma$, a normalization factor is computed. The gating function ($G(x)=1+\tanh(x+\beta)$) then adaptively weights each channel based on its importance where $\beta$ is another learnable scalar. Then the final output $\bar{\mathcal{F}}^i_{\text{SEA}}=\mathcal{F}^i_{\text{SEA}} \cdot G$ enhances relevant features while suppressing noise, making the model robust against the sparsity of information in high-resolution satellite imagery.
  • Figure 5: Qualitative comparisons on the DSIFN-CD dataset. We show the comparison with the eight best existing change detection approaches in the literature, whose codebases are publicly available.(a) Pre-Change image (A), (b) Post-Change image (B), (c) FC-EF 18, (d) FC-Siam-diff 18, (e) FC-Siam-conc 18, (f) DTCDSCN DTCDSCN, (g) ChangeMamba changemamba, (h) BIT bit, (i) ChangeFormer bit, (j) ELGC-Net elgcnet, (k) GRAD-Former (OURS) and (l) Ground truth. Color scheme: white represents TP (i.e., “changed”), black corresponds to TN (i.e., “unchanged”), red indicates FP, and green denotes FN.
  • ...and 2 more figures