SatSwinMAE: Efficient Autoencoding for Multiscale Time-series Satellite Imagery

Yohei Nakayama; Jiawei Su; Luis M. Pazos-Outón

SatSwinMAE: Efficient Autoencoding for Multiscale Time-series Satellite Imagery

Yohei Nakayama, Jiawei Su, Luis M. Pazos-Outón

TL;DR

The SwinMAE model is extended to integrate temporal information for satellite time-series data, and shows significant performance improvements over existing state-of-the-art foundation models for all the evaluated downstream tasks: land cover segmentation, building density prediction, flood mapping, wildfire scar mapping and multi-temporal crop segmentation.

Abstract

Recent advancements in foundation models have significantly impacted various fields, including natural language processing, computer vision, and multi-modal tasks. One area that stands to benefit greatly is Earth observation, where these models can efficiently process large-scale, unlabeled geospatial data. In this work we extend the SwinMAE model to integrate temporal information for satellite time-series data. The architecture employs a hierarchical 3D Masked Autoencoder (MAE) with Video Swin Transformer blocks to effectively capture multi-scale spatio-temporal dependencies in satellite imagery. To enhance transfer learning, we incorporate both encoder and decoder pretrained weights, along with skip connections to preserve scale-specific information. This forms an architecture similar to SwinUNet with an additional temporal component. Our approach shows significant performance improvements over existing state-of-the-art foundation models for all the evaluated downstream tasks: land cover segmentation, building density prediction, flood mapping, wildfire scar mapping and multi-temporal crop segmentation. Particularly, in the land cover segmentation task of the PhilEO Bench dataset, it outperforms other geospatial foundation models with a 10.4% higher accuracy.

SatSwinMAE: Efficient Autoencoding for Multiscale Time-series Satellite Imagery

TL;DR

Abstract

Paper Structure (23 sections, 5 figures, 1 table)

This paper contains 23 sections, 5 figures, 1 table.

Introduction
Related work
Vision Transformers (ViT) for Image Classification
Transformer-Based 3D Vision Models
Self-supervised learning in the geospatial domain
Swin Transformers
Methodology
Pretraining
Dataset
Model Architecture
Training Settings
Transfer Learning
Experiments and Discussion
Pretraining
Transfer Learning
...and 8 more sections

Figures (5)

Figure 1: Swin model architecture for MAE pretraining.
Figure 2: Swin architecture with UNet-like residual connections for finetuning.
Figure 3: An example of re-constructed satellite imagery during pretraining. Only RGB bands are shown here for visualization purposes.
Figure 4: PhilEO Bench land cover classification accuracy.
Figure 5: PhilEO Bench building density prediction results.

SatSwinMAE: Efficient Autoencoding for Multiscale Time-series Satellite Imagery

TL;DR

Abstract

SatSwinMAE: Efficient Autoencoding for Multiscale Time-series Satellite Imagery

Authors

TL;DR

Abstract

Table of Contents

Figures (5)