Table of Contents
Fetching ...

Rethinking Remote Sensing Change Detection With A Mask View

Xiaowen Ma, Zhenkai Wu, Rongrong Lian, Wei Zhang, Siyang Song

TL;DR

The proposed meta-architecture CDMask can adapt to different latent data distributions, thus accurately identifying regions of interest changes in complex scenarios, and the instance network CDMaskFormer customized for the change detection task is proposed.

Abstract

Remote sensing change detection aims to compare two or more images recorded for the same area but taken at different time stamps to quantitatively and qualitatively assess changes in geographical entities and environmental factors. Mainstream models usually built on pixel-by-pixel change detection paradigms, which cannot tolerate the diversity of changes due to complex scenes and variation in imaging conditions. To address this shortcoming, this paper rethinks the change detection with the mask view, and further proposes the corresponding: 1) meta-architecture CDMask and 2) instance network CDMaskFormer. Components of CDMask include Siamese backbone, change extractor, pixel decoder, transformer decoder and normalized detector, which ensures the proper functioning of the mask detection paradigm. Since the change query can be adaptively updated based on the bi-temporal feature content, the proposed CDMask can adapt to different latent data distributions, thus accurately identifying regions of interest changes in complex scenarios. Consequently, we further propose the instance network CDMaskFormer customized for the change detection task, which includes: (i) a Spatial-temporal convolutional attention-based instantiated change extractor to capture spatio-temporal context simultaneously with lightweight operations; and (ii) a scene-guided axial attention-instantiated transformer decoder to extract more spatial details. State-of-the-art performance of CDMaskFormer is achieved on five benchmark datasets with a satisfactory efficiency-accuracy trade-off. Code is available at https://github.com/xwmaxwma/rschange.

Rethinking Remote Sensing Change Detection With A Mask View

TL;DR

The proposed meta-architecture CDMask can adapt to different latent data distributions, thus accurately identifying regions of interest changes in complex scenarios, and the instance network CDMaskFormer customized for the change detection task is proposed.

Abstract

Remote sensing change detection aims to compare two or more images recorded for the same area but taken at different time stamps to quantitatively and qualitatively assess changes in geographical entities and environmental factors. Mainstream models usually built on pixel-by-pixel change detection paradigms, which cannot tolerate the diversity of changes due to complex scenes and variation in imaging conditions. To address this shortcoming, this paper rethinks the change detection with the mask view, and further proposes the corresponding: 1) meta-architecture CDMask and 2) instance network CDMaskFormer. Components of CDMask include Siamese backbone, change extractor, pixel decoder, transformer decoder and normalized detector, which ensures the proper functioning of the mask detection paradigm. Since the change query can be adaptively updated based on the bi-temporal feature content, the proposed CDMask can adapt to different latent data distributions, thus accurately identifying regions of interest changes in complex scenarios. Consequently, we further propose the instance network CDMaskFormer customized for the change detection task, which includes: (i) a Spatial-temporal convolutional attention-based instantiated change extractor to capture spatio-temporal context simultaneously with lightweight operations; and (ii) a scene-guided axial attention-instantiated transformer decoder to extract more spatial details. State-of-the-art performance of CDMaskFormer is achieved on five benchmark datasets with a satisfactory efficiency-accuracy trade-off. Code is available at https://github.com/xwmaxwma/rschange.
Paper Structure (19 sections, 15 equations, 11 figures, 8 tables)

This paper contains 19 sections, 15 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Visualization of (a) architecture Comparison of CDPixel and CDMask and (b) Performance-Computing curves on five benchmark datasets. FLOPs are calculated using an input size of $256 \times 256$. The experiments are carried out five times with different random seeds, with the center of the circle indicating the median value of the performance and the size of the radius of the circle indicating the standard deviation of the performance. It can be observed that CDMaskFormer achieves state-of-the-art results and the most satisfactory trade-off between change detection performance and computational complexity on five benchmark datasets.
  • Figure 2: Visualization of the latent feature distribution for changes of interest in different bi-temporal images of BIT bit, which is an instance of CDPixel. Blue, orange and green colors represent features belonging to the changes of interest in the image pairs (a), (b) and (c), and gray color indicates unchanged features, respectively.
  • Figure 3: Description of the normalized detector. The range of output values is different for different input images. (a) is the statistics of the maximum value of the change channel on the DSIFN-CD dataset. (b) and (c) are example bi-temporal images. (d) and (e) are heat maps of the values before and after Normalized, respectively. We introduce min-max Normalized to map the data to between 0 and 1, so that detect changes based on a fixed threshold.
  • Figure 4: Architecture of the CDMaskFormer. Given the bi-temporal images $\mathcal{T}_1$ and $\mathcal{T}_2$, a pair of weight-shared backbones is applied to obtain the features $\mathcal{F}$. For each layer of bi-temporal features $\mathcal{F}_i^1$ and $\mathcal{F}_i^2$, a change extractor is passed through to obtain the change representations $\mathcal{R}_i$, which is passed through a projection matrix to unify channels, and then refined by a pixel decoder. Then, the randomly initialized change queries $\mathcal{R}_q$ and the refined change representations are input to the detail-enhanced decoder for information interaction. The change queries are updated by the bi-temporal feature content. Finally, the obtained change prototypes $\mathcal{R}_p$ with $\mathcal{R}_4$ are input into the normalized detector to obtain the output mask. Note that DWConv and PWConv denote depth-wise convolution and point-wise convolution, respectively, and DMLP denotes dense multilayer perceptron, which will be described in Section 3.2.
  • Figure 5: Structure of masked attention block (MAB), scene-guided axial attention block (SAAB) and scene-guided axial attention module. The SAAB use scene-guided axial attention to facilitate the informative interaction of change queries with high-resolution feature maps to mine further details.
  • ...and 6 more figures