Table of Contents
Fetching ...

MultiMAE Meets Earth Observation: Pre-training Multi-modal Multi-task Masked Autoencoders for Earth Observation Tasks

Jose Sosa, Danila Rukhovich, Anis Kacem, Djamila Aouada

TL;DR

The paper adapts a Multi-modal, Multi-task Masked Autoencoder (MultiMAE) to Earth Observation by partitioning Sentinel-2 bands into multiple modalities and incorporating elevation and segmentation data. A shared Vision Transformer encoder with modality-specific decoders learns robust representations through masked reconstruction across six EO modalities, enabling flexible fine-tuning when downstream data misaligns with pre-training. Pre-training on the MM Earth dataset followed by evaluation on GEO-Bench classification and segmentation tasks shows state-of-the-art transfer performance without requiring larger backbones or fully matched input modalities. The approach demonstrates strong transferability and adaptability to varying input configurations, highlighting the potential to standardize multi-modal pre-training in EO and to support diverse downstream applications.

Abstract

Multi-modal data in Earth Observation (EO) presents a huge opportunity for improving transfer learning capabilities when pre-training deep learning models. Unlike prior work that often overlooks multi-modal EO data, recent methods have started to include it, resulting in more effective pre-training strategies. However, existing approaches commonly face challenges in effectively transferring learning to downstream tasks where the structure of available data differs from that used during pre-training. This paper addresses this limitation by exploring a more flexible multi-modal, multi-task pre-training strategy for EO data. Specifically, we adopt a Multi-modal Multi-task Masked Autoencoder (MultiMAE) that we pre-train by reconstructing diverse input modalities, including spectral, elevation, and segmentation data. The pre-trained model demonstrates robust transfer learning capabilities, outperforming state-of-the-art methods on various EO datasets for classification and segmentation tasks. Our approach exhibits significant flexibility, handling diverse input configurations without requiring modality-specific pre-trained models. Code will be available at: https://github.com/josesosajs/multimae-meets-eo.

MultiMAE Meets Earth Observation: Pre-training Multi-modal Multi-task Masked Autoencoders for Earth Observation Tasks

TL;DR

The paper adapts a Multi-modal, Multi-task Masked Autoencoder (MultiMAE) to Earth Observation by partitioning Sentinel-2 bands into multiple modalities and incorporating elevation and segmentation data. A shared Vision Transformer encoder with modality-specific decoders learns robust representations through masked reconstruction across six EO modalities, enabling flexible fine-tuning when downstream data misaligns with pre-training. Pre-training on the MM Earth dataset followed by evaluation on GEO-Bench classification and segmentation tasks shows state-of-the-art transfer performance without requiring larger backbones or fully matched input modalities. The approach demonstrates strong transferability and adaptability to varying input configurations, highlighting the potential to standardize multi-modal pre-training in EO and to support diverse downstream applications.

Abstract

Multi-modal data in Earth Observation (EO) presents a huge opportunity for improving transfer learning capabilities when pre-training deep learning models. Unlike prior work that often overlooks multi-modal EO data, recent methods have started to include it, resulting in more effective pre-training strategies. However, existing approaches commonly face challenges in effectively transferring learning to downstream tasks where the structure of available data differs from that used during pre-training. This paper addresses this limitation by exploring a more flexible multi-modal, multi-task pre-training strategy for EO data. Specifically, we adopt a Multi-modal Multi-task Masked Autoencoder (MultiMAE) that we pre-train by reconstructing diverse input modalities, including spectral, elevation, and segmentation data. The pre-trained model demonstrates robust transfer learning capabilities, outperforming state-of-the-art methods on various EO datasets for classification and segmentation tasks. Our approach exhibits significant flexibility, handling diverse input configurations without requiring modality-specific pre-trained models. Code will be available at: https://github.com/josesosajs/multimae-meets-eo.

Paper Structure

This paper contains 22 sections, 1 equation, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Pre-traning and fine-tuning stages of our MultiMAE adaptation to EO data. During pre-training MultiMAE relies on multiple input modalities. The model includes a shared ViT-based encoder and as many decoders as input modalities to support multi-tasking. When finetuning, the pre-trained encoder is coupled with the task specific model (depending on the downstream task). Note that during this stage the number of input modalities could be different from those on pre-training.
  • Figure 1: Spatial and temporal distribution of MMEarth dataset. Data from MMEarth spans across 4 years from multiple world regions. Multi-modal data has been collected and properly aligned using Google Earth Engine Platform. Figure taken from nedungadi2024mmearth.
  • Figure 2: MultiMAE pre-training with EO data. Patches are randomly sampled from six input modalities from EO data, RGB, IRED, SIRED, EB, DEPTH, and SEG (for simplicity only three are depicted in the figure). Then, those are linearly projected and encoded via a ViT encoder. Finally, task-specific decoders reconstruct masked patches for all input modalities.
  • Figure 2: MultiMAE pre-training and fine-tuning with EO data. The top part of the figure illustrates the pre-training stage with six input modalities from EO data: RGB, IRED, SIRED, EB, DEPTH, and SEG (for simplicity, only three are depicted in the figure). The bottom part depicts fine-tuning setups. When fine-tuning, task-specific models are coupled with a pre-trained MultiMAE encoder. Fine-tuning occurs under multiple scenarios, e.g. single-modality or multi-modality, by varying the number of input modalities.
  • Figure 3: Decoders design. The tokens from the encoder are firstly linearly projected to match the decoder dimension. Then, modality-specific and positional embeddings are added. A cross-attention layer incorporate information from tokens of the general representation of all the modalities, which is then processed by an MLP and a couple of transformer blocks. Finally, tokens are projected and reshaped to build an image.
  • ...and 3 more figures