MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning
Vishal Nedungadi, Ankit Kariryaa, Stefan Oehmcke, Serge Belongie, Christian Igel, Nico Lang
TL;DR
This work addresses the scarcity of labeled data for Earth Observation (EO) by introducing MMEarth, a global multi-modal pretraining dataset with 1.2 million geolocated locations and 12 modalities. It proposes MP-MAE, a ConvNeXt V2 based fully convolutional masked autoencoder that learns general-purpose representations by jointly solving multiple pretext tasks across pixel-level and image-level modalities. The results show that multi-modal pretext tasks and domain-specific pretraining yield improvements in both fine-tuning and linear probing, with notable label-efficient gains in few-shot settings. This approach enhances scalable, data-efficient EO representation learning, offering practical benefits for global-scale tasks such as land cover, crop type, and climate-zone classification while highlighting areas for further refinement.
Abstract
The volume of unlabelled Earth observation (EO) data is huge, but many important applications lack labelled training data. However, EO data offers the unique opportunity to pair data from different modalities and sensors automatically based on geographic location and time, at virtually no human labor cost. We seize this opportunity to create MMEarth, a diverse multi-modal pretraining dataset at global scale. Using this new corpus of 1.2 million locations, we propose a Multi-Pretext Masked Autoencoder (MP-MAE) approach to learn general-purpose representations for optical satellite images. Our approach builds on the ConvNeXt V2 architecture, a fully convolutional masked autoencoder (MAE). Drawing upon a suite of multi-modal pretext tasks, we demonstrate that our MP-MAE approach outperforms both MAEs pretrained on ImageNet and MAEs pretrained on domain-specific satellite images. This is shown on several downstream tasks including image classification and semantic segmentation. We find that pretraining with multi-modal pretext tasks notably improves the linear probing performance compared to pretraining on optical satellite images only. This also leads to better label efficiency and parameter efficiency which are crucial aspects in global scale applications.
