MultiMAE Meets Earth Observation: Pre-training Multi-modal Multi-task Masked Autoencoders for Earth Observation Tasks
Jose Sosa, Danila Rukhovich, Anis Kacem, Djamila Aouada
TL;DR
The paper adapts a Multi-modal, Multi-task Masked Autoencoder (MultiMAE) to Earth Observation by partitioning Sentinel-2 bands into multiple modalities and incorporating elevation and segmentation data. A shared Vision Transformer encoder with modality-specific decoders learns robust representations through masked reconstruction across six EO modalities, enabling flexible fine-tuning when downstream data misaligns with pre-training. Pre-training on the MM Earth dataset followed by evaluation on GEO-Bench classification and segmentation tasks shows state-of-the-art transfer performance without requiring larger backbones or fully matched input modalities. The approach demonstrates strong transferability and adaptability to varying input configurations, highlighting the potential to standardize multi-modal pre-training in EO and to support diverse downstream applications.
Abstract
Multi-modal data in Earth Observation (EO) presents a huge opportunity for improving transfer learning capabilities when pre-training deep learning models. Unlike prior work that often overlooks multi-modal EO data, recent methods have started to include it, resulting in more effective pre-training strategies. However, existing approaches commonly face challenges in effectively transferring learning to downstream tasks where the structure of available data differs from that used during pre-training. This paper addresses this limitation by exploring a more flexible multi-modal, multi-task pre-training strategy for EO data. Specifically, we adopt a Multi-modal Multi-task Masked Autoencoder (MultiMAE) that we pre-train by reconstructing diverse input modalities, including spectral, elevation, and segmentation data. The pre-trained model demonstrates robust transfer learning capabilities, outperforming state-of-the-art methods on various EO datasets for classification and segmentation tasks. Our approach exhibits significant flexibility, handling diverse input configurations without requiring modality-specific pre-trained models. Code will be available at: https://github.com/josesosajs/multimae-meets-eo.
