Table of Contents
Fetching ...

Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery

Mubashir Noman, Muzammal Naseer, Hisham Cholakkal, Rao Muhammad Anwar, Salman Khan, Fahad Shahbaz Khan

TL;DR

The proposed SatMAE++ with multi-scale pre-training is equally effective for both optical as well as multi-spectral imagery and achieves mean average precision (mAP) gain of 2.5% for multi-label classification task on BigEarthNet dataset.

Abstract

Recent advances in unsupervised learning have demonstrated the ability of large vision models to achieve promising results on downstream tasks by pre-training on large amount of unlabelled data. Such pre-training techniques have also been explored recently in the remote sensing domain due to the availability of large amount of unlabelled data. Different from standard natural image datasets, remote sensing data is acquired from various sensor technologies and exhibit diverse range of scale variations as well as modalities. Existing satellite image pre-training methods either ignore the scale information present in the remote sensing imagery or restrict themselves to use only a single type of data modality. In this paper, we re-visit transformers pre-training and leverage multi-scale information that is effectively utilized with multiple modalities. Our proposed approach, named SatMAE++, performs multi-scale pre-training and utilizes convolution based upsampling blocks to reconstruct the image at higher scales making it extensible to include more scales. Compared to existing works, the proposed SatMAE++ with multi-scale pre-training is equally effective for both optical as well as multi-spectral imagery. Extensive experiments on six datasets reveal the merits of proposed contributions, leading to state-of-the-art performance on all datasets. SatMAE++ achieves mean average precision (mAP) gain of 2.5\% for multi-label classification task on BigEarthNet dataset. Our code and pre-trained models are available at \url{https://github.com/techmn/satmae_pp}.

Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery

TL;DR

The proposed SatMAE++ with multi-scale pre-training is equally effective for both optical as well as multi-spectral imagery and achieves mean average precision (mAP) gain of 2.5% for multi-label classification task on BigEarthNet dataset.

Abstract

Recent advances in unsupervised learning have demonstrated the ability of large vision models to achieve promising results on downstream tasks by pre-training on large amount of unlabelled data. Such pre-training techniques have also been explored recently in the remote sensing domain due to the availability of large amount of unlabelled data. Different from standard natural image datasets, remote sensing data is acquired from various sensor technologies and exhibit diverse range of scale variations as well as modalities. Existing satellite image pre-training methods either ignore the scale information present in the remote sensing imagery or restrict themselves to use only a single type of data modality. In this paper, we re-visit transformers pre-training and leverage multi-scale information that is effectively utilized with multiple modalities. Our proposed approach, named SatMAE++, performs multi-scale pre-training and utilizes convolution based upsampling blocks to reconstruct the image at higher scales making it extensible to include more scales. Compared to existing works, the proposed SatMAE++ with multi-scale pre-training is equally effective for both optical as well as multi-spectral imagery. Extensive experiments on six datasets reveal the merits of proposed contributions, leading to state-of-the-art performance on all datasets. SatMAE++ achieves mean average precision (mAP) gain of 2.5\% for multi-label classification task on BigEarthNet dataset. Our code and pre-trained models are available at \url{https://github.com/techmn/satmae_pp}.
Paper Structure (19 sections, 5 equations, 5 figures, 8 tables)

This paper contains 19 sections, 5 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Illustration of Mask Autoencoder (MAE) framework for SatMAE++. The input image having spatial resolution of $(4H, 4W)$ is downsampled twice to obtain the images of resolution $(2H, 2W)$ and $(H, W)$, respectively. We then feed the image with resolution $(H, W)$ to the MAE similar to the SatMAE cong2023satmae framework. The decoder reconstructs the image at the resolution $(H, W)$ and apply MSE loss to measure reconstruction quality. The reconstructed output is projected back to the feature space and is upsampled through upsampling blocks to obtain features at $(2H, 2W)$ and $(4H, 4W)$ resolutions. The upsampled outputs are projected back to the image space and L1 loss is utilized to penalize the reconstructions at higher resolutions. The overall loss is the weighted mean of all the losses.
  • Figure 2: The illustration of upsample block used in SatMAE++ framework. Input features $X$ are upsampled by utilizing the transpose convolution operation. Afterwards, a residual block which is composed of two convolution layers is employed to enhance the upsampled features given as $\Tilde{X}$.
  • Figure 3: SatMAE++ reconstruction results at multi-scale level. Examples from fMoW-Sentinel dataset are shown here. For illustration, we show the RGB channels of the multi-spectral data here. The images are reconstructed at resolutions of $(H,W)$, $(2H,2W)$, and $(4H,4W)$, respectively. We observe that the proposed model provide better reconstruction results compared to SatMAE at resolution of $(H,W)$.
  • Figure 4: Here, we compare the reconstruction performance of our framework with the baseline SatMAE. We observe that the reconstruction results of SatMAE on visible patches is worse compared to the masked patches. Whereas our framework provides much better results on all the patches including the visible patches. The above reported results demonstrate the effectiveness of multi-scale pre-training framework SatMAE++.
  • Figure 5: Illustration of finetuning convergence on validation set of fMoW-Sentinel dataset. We observe that the model pre-trained with multi-scales achieves faster convergence as compared the model pre-trained with single or less scales. The model trained with single scale achieves highest score of 61.61 at 20th epoch whereas the model that utilised three scales in pre-training converges earlier and achieves highest score at 12th epoch.