Table of Contents
Fetching ...

MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning

Vishal Nedungadi, Ankit Kariryaa, Stefan Oehmcke, Serge Belongie, Christian Igel, Nico Lang

TL;DR

This work addresses the scarcity of labeled data for Earth Observation (EO) by introducing MMEarth, a global multi-modal pretraining dataset with 1.2 million geolocated locations and 12 modalities. It proposes MP-MAE, a ConvNeXt V2 based fully convolutional masked autoencoder that learns general-purpose representations by jointly solving multiple pretext tasks across pixel-level and image-level modalities. The results show that multi-modal pretext tasks and domain-specific pretraining yield improvements in both fine-tuning and linear probing, with notable label-efficient gains in few-shot settings. This approach enhances scalable, data-efficient EO representation learning, offering practical benefits for global-scale tasks such as land cover, crop type, and climate-zone classification while highlighting areas for further refinement.

Abstract

The volume of unlabelled Earth observation (EO) data is huge, but many important applications lack labelled training data. However, EO data offers the unique opportunity to pair data from different modalities and sensors automatically based on geographic location and time, at virtually no human labor cost. We seize this opportunity to create MMEarth, a diverse multi-modal pretraining dataset at global scale. Using this new corpus of 1.2 million locations, we propose a Multi-Pretext Masked Autoencoder (MP-MAE) approach to learn general-purpose representations for optical satellite images. Our approach builds on the ConvNeXt V2 architecture, a fully convolutional masked autoencoder (MAE). Drawing upon a suite of multi-modal pretext tasks, we demonstrate that our MP-MAE approach outperforms both MAEs pretrained on ImageNet and MAEs pretrained on domain-specific satellite images. This is shown on several downstream tasks including image classification and semantic segmentation. We find that pretraining with multi-modal pretext tasks notably improves the linear probing performance compared to pretraining on optical satellite images only. This also leads to better label efficiency and parameter efficiency which are crucial aspects in global scale applications.

MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning

TL;DR

This work addresses the scarcity of labeled data for Earth Observation (EO) by introducing MMEarth, a global multi-modal pretraining dataset with 1.2 million geolocated locations and 12 modalities. It proposes MP-MAE, a ConvNeXt V2 based fully convolutional masked autoencoder that learns general-purpose representations by jointly solving multiple pretext tasks across pixel-level and image-level modalities. The results show that multi-modal pretext tasks and domain-specific pretraining yield improvements in both fine-tuning and linear probing, with notable label-efficient gains in few-shot settings. This approach enhances scalable, data-efficient EO representation learning, offering practical benefits for global-scale tasks such as land cover, crop type, and climate-zone classification while highlighting areas for further refinement.

Abstract

The volume of unlabelled Earth observation (EO) data is huge, but many important applications lack labelled training data. However, EO data offers the unique opportunity to pair data from different modalities and sensors automatically based on geographic location and time, at virtually no human labor cost. We seize this opportunity to create MMEarth, a diverse multi-modal pretraining dataset at global scale. Using this new corpus of 1.2 million locations, we propose a Multi-Pretext Masked Autoencoder (MP-MAE) approach to learn general-purpose representations for optical satellite images. Our approach builds on the ConvNeXt V2 architecture, a fully convolutional masked autoencoder (MAE). Drawing upon a suite of multi-modal pretext tasks, we demonstrate that our MP-MAE approach outperforms both MAEs pretrained on ImageNet and MAEs pretrained on domain-specific satellite images. This is shown on several downstream tasks including image classification and semantic segmentation. We find that pretraining with multi-modal pretext tasks notably improves the linear probing performance compared to pretraining on optical satellite images only. This also leads to better label efficiency and parameter efficiency which are crucial aspects in global scale applications.
Paper Structure (35 sections, 8 equations, 7 figures, 8 tables)

This paper contains 35 sections, 8 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: MMEarth dataset coverage. With a balanced sampling scheme across 14 biomes and 4 years, we collected aligned multi-modal data from 12 modalities using the Google Earth Engine platform gorelick2017google at 1.2M locations.
  • Figure 2: Multi-Pretext Masked Autoencoder (MP-MAE). Our approach extends Masked Autoencoders, which reconstruct only the input image, by incorporating multiple pretext tasks using aligned pixel-level as well as image-level modalities.
  • Figure 3: Label efficiency for few-shot downstream performance. Linear probing performance for varying downstream dataset sizes. MP-MAE ('Atto') pretrained on ImageNet, MMEarth64-S2 (multi-spectral only), MMEarth64 (all multi-modal pretext tasks).
  • Figure A1: Spatial distribution of L1C and L2A data.
  • Figure A3: Distribution of additional modalities.
  • ...and 2 more figures