Table of Contents
Fetching ...

SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery

Yezhen Cong, Samar Khanna, Chenlin Meng, Patrick Liu, Erik Rozi, Yutong He, Marshall Burke, David B. Lobell, Stefano Ermon

TL;DR

SatMAE extends masked autoencoding to satellite imagery by introducing temporal and spectral encodings and novel masking strategies, enabling effective self-supervised pre-training on temporally irregular and multi-spectral data. By encoding temporal stamps and spectral group information, and by employing consistent or independent masking across time or spectral groups, SatMAE learns representations that transfer strongly to land cover, segmentation, and other remote-sensing tasks. Across fMoW RGB/RGB-temporal and fMoW Sentinel multi-spectral datasets, SatMAE yields substantial improvements over prior SSL methods and supervised baselines, demonstrating strong potential for scalable, label-efficient remote sensing analysis. The approach shows particular promise for leveraging large unlabeled RS corpora to improve downstream tasks with societal impact, such as poverty mapping and infrastructure assessment, while acknowledging needs for efficient architectures and careful handling of geographic biases.

Abstract

Unsupervised pre-training methods for large vision models have shown to enhance performance on downstream supervised tasks. Developing similar techniques for satellite imagery presents significant opportunities as unlabelled data is plentiful and the inherent temporal and multi-spectral structure provides avenues to further improve existing pre-training strategies. In this paper, we present SatMAE, a pre-training framework for temporal or multi-spectral satellite imagery based on Masked Autoencoder (MAE). To leverage temporal information, we include a temporal embedding along with independently masking image patches across time. In addition, we demonstrate that encoding multi-spectral data as groups of bands with distinct spectral positional encodings is beneficial. Our approach yields strong improvements over previous state-of-the-art techniques, both in terms of supervised learning performance on benchmark datasets (up to $\uparrow$ 7%), and transfer learning performance on downstream remote sensing tasks, including land cover classification (up to $\uparrow$ 14%) and semantic segmentation. Code and data are available on the project website: https://sustainlab-group.github.io/SatMAE/

SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery

TL;DR

SatMAE extends masked autoencoding to satellite imagery by introducing temporal and spectral encodings and novel masking strategies, enabling effective self-supervised pre-training on temporally irregular and multi-spectral data. By encoding temporal stamps and spectral group information, and by employing consistent or independent masking across time or spectral groups, SatMAE learns representations that transfer strongly to land cover, segmentation, and other remote-sensing tasks. Across fMoW RGB/RGB-temporal and fMoW Sentinel multi-spectral datasets, SatMAE yields substantial improvements over prior SSL methods and supervised baselines, demonstrating strong potential for scalable, label-efficient remote sensing analysis. The approach shows particular promise for leveraging large unlabeled RS corpora to improve downstream tasks with societal impact, such as poverty mapping and infrastructure assessment, while acknowledging needs for efficient architectures and careful handling of geographic biases.

Abstract

Unsupervised pre-training methods for large vision models have shown to enhance performance on downstream supervised tasks. Developing similar techniques for satellite imagery presents significant opportunities as unlabelled data is plentiful and the inherent temporal and multi-spectral structure provides avenues to further improve existing pre-training strategies. In this paper, we present SatMAE, a pre-training framework for temporal or multi-spectral satellite imagery based on Masked Autoencoder (MAE). To leverage temporal information, we include a temporal embedding along with independently masking image patches across time. In addition, we demonstrate that encoding multi-spectral data as groups of bands with distinct spectral positional encodings is beneficial. Our approach yields strong improvements over previous state-of-the-art techniques, both in terms of supervised learning performance on benchmark datasets (up to 7%), and transfer learning performance on downstream remote sensing tasks, including land cover classification (up to 14%) and semantic segmentation. Code and data are available on the project website: https://sustainlab-group.github.io/SatMAE/
Paper Structure (74 sections, 3 equations, 14 figures, 5 tables)

This paper contains 74 sections, 3 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: With carefully-designed masking strategies across mutli-spectral and temporal images, and temporal and spectral positional encodings, our SatMAE serves as a powerful SSL vision learner for remote sensing tasks.
  • Figure 2: Top: Encoding each temporal patch with a shared patch embedding $f_p$. Bottom: Encoding each spectral patch with a different patch embedding $f_{p_j}$ for each group $j$.
  • Figure 3: \ref{['fig:temporal_masking']}Temporal masking: For images in a timeseries, we can choose to keep a patch fully visible or fully masked across time (consistent masking), or independently mask all patches (independent masking). In both cases, a fraction $p_m$ patches are masked. Here, $T=3$, and the leftmost column orders the temporal sequence according to the timestamp features. For example, "y-12, m-12, h-15" is 12 years from the minimum year (2002), the zero-indexed month 2, and the 15th hour of the day; i.e., roughly 2014, March, 15:00. \ref{['fig:spectral_masking']}Spectral Masking: The same masking strategies are adapted to groups of the 13 spectral bands in Sentinel-2 images.
  • Figure 4: Top 1 Accuracy on fMoW classification. Frozen: only performing linear classification on frozen features of the pre-trained model. Finetune: end-to-end finetuning the whole model. * is training from scratch, and $\dagger$ is using supervised-learning ImageNet weights, and $\ddagger$ is SSL MAE ImageNet weights.
  • Figure 5: Reconstruction quality of SatMAE+IM (left) vs. SatMAE+CM (right). Further results in \ref{['sec:viz']}.
  • ...and 9 more figures