SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery
Yezhen Cong, Samar Khanna, Chenlin Meng, Patrick Liu, Erik Rozi, Yutong He, Marshall Burke, David B. Lobell, Stefano Ermon
TL;DR
SatMAE extends masked autoencoding to satellite imagery by introducing temporal and spectral encodings and novel masking strategies, enabling effective self-supervised pre-training on temporally irregular and multi-spectral data. By encoding temporal stamps and spectral group information, and by employing consistent or independent masking across time or spectral groups, SatMAE learns representations that transfer strongly to land cover, segmentation, and other remote-sensing tasks. Across fMoW RGB/RGB-temporal and fMoW Sentinel multi-spectral datasets, SatMAE yields substantial improvements over prior SSL methods and supervised baselines, demonstrating strong potential for scalable, label-efficient remote sensing analysis. The approach shows particular promise for leveraging large unlabeled RS corpora to improve downstream tasks with societal impact, such as poverty mapping and infrastructure assessment, while acknowledging needs for efficient architectures and careful handling of geographic biases.
Abstract
Unsupervised pre-training methods for large vision models have shown to enhance performance on downstream supervised tasks. Developing similar techniques for satellite imagery presents significant opportunities as unlabelled data is plentiful and the inherent temporal and multi-spectral structure provides avenues to further improve existing pre-training strategies. In this paper, we present SatMAE, a pre-training framework for temporal or multi-spectral satellite imagery based on Masked Autoencoder (MAE). To leverage temporal information, we include a temporal embedding along with independently masking image patches across time. In addition, we demonstrate that encoding multi-spectral data as groups of bands with distinct spectral positional encodings is beneficial. Our approach yields strong improvements over previous state-of-the-art techniques, both in terms of supervised learning performance on benchmark datasets (up to $\uparrow$ 7%), and transfer learning performance on downstream remote sensing tasks, including land cover classification (up to $\uparrow$ 14%) and semantic segmentation. Code and data are available on the project website: https://sustainlab-group.github.io/SatMAE/
