Table of Contents
Fetching ...

MAESTRO: Masked AutoEncoders for Multimodal, Multitemporal, and Multispectral Earth Observation Data

Antoine Labatie, Michael Vaccaro, Nina Lardiere, Anatol Garioud, Nicolas Gonthier

TL;DR

MAESTRO advances self-supervised learning for Earth Observation by tailoring Masked Autoencoders to multimodal, multitemporal, and multispectral data. It systematically benchmarks fusion strategies and introduces a patch-group-wise spectral normalization that injects a spectral prior without increasing token counts, achieving state-of-the-art results on multitemporal tasks and strong performance elsewhere. The approach emphasizes early fusion across time steps for similar modalities and joint-token multispectral fusion with spectral priors, while maintaining computational efficiency. Extensive ablations across four EO datasets, plus cross-dataset transfer experiments, demonstrate the benefits of exploiting temporal dynamics and spectral structure in SSL for EO. The work provides reproducible experimental details and code to facilitate adoption in EO research and applications.

Abstract

Self-supervised learning holds great promise for remote sensing, but standard self-supervised methods must be adapted to the unique characteristics of Earth observation data. We take a step in this direction by conducting a comprehensive benchmark of fusion strategies and normalization schemes of reconstruction targets for multimodal, multitemporal, and multispectral Earth observation data. Based on our findings, we introduce MAESTRO, a novel adaptation of the Masked Autoencoder with optimized fusion mechanisms and a normalization scheme that incorporates a spectral prior as a self-supervisory signal. Evaluated on four Earth observation datasets in both intra- and cross-dataset settings, MAESTRO achieves state-of-the-art performance on tasks that strongly rely on multitemporal dynamics, while also remaining competitive on others. Code to reproduce all our experiments is available at https://github.com/ignf/maestro.

MAESTRO: Masked AutoEncoders for Multimodal, Multitemporal, and Multispectral Earth Observation Data

TL;DR

MAESTRO advances self-supervised learning for Earth Observation by tailoring Masked Autoencoders to multimodal, multitemporal, and multispectral data. It systematically benchmarks fusion strategies and introduces a patch-group-wise spectral normalization that injects a spectral prior without increasing token counts, achieving state-of-the-art results on multitemporal tasks and strong performance elsewhere. The approach emphasizes early fusion across time steps for similar modalities and joint-token multispectral fusion with spectral priors, while maintaining computational efficiency. Extensive ablations across four EO datasets, plus cross-dataset transfer experiments, demonstrate the benefits of exploiting temporal dynamics and spectral structure in SSL for EO. The work provides reproducible experimental details and code to facilitate adoption in EO research and applications.

Abstract

Self-supervised learning holds great promise for remote sensing, but standard self-supervised methods must be adapted to the unique characteristics of Earth observation data. We take a step in this direction by conducting a comprehensive benchmark of fusion strategies and normalization schemes of reconstruction targets for multimodal, multitemporal, and multispectral Earth observation data. Based on our findings, we introduce MAESTRO, a novel adaptation of the Masked Autoencoder with optimized fusion mechanisms and a normalization scheme that incorporates a spectral prior as a self-supervisory signal. Evaluated on four Earth observation datasets in both intra- and cross-dataset settings, MAESTRO achieves state-of-the-art performance on tasks that strongly rely on multitemporal dynamics, while also remaining competitive on others. Code to reproduce all our experiments is available at https://github.com/ignf/maestro.

Paper Structure

This paper contains 79 sections, 9 equations, 8 figures, 27 tables.

Figures (8)

  • Figure 1: Overview of MAESTRO. MAESTRO extends the Masked Autoencoder to orchestrate the complex interplay of multimodal, multitemporal, and multispectral Earth Observation data. It employs token-based early fusion across time steps and similar modalities, and token-based late fusion across dissimilar modalities. It uses joint-token fusion for multispectrality, but still relies on a novel normalization of reconstruction targets—namely, patch-group-wise within groups of highly correlated bands—to inject a useful spectral prior during pre-training. Best viewed in color.
  • Figure 2: Token-based fusion modes for handling multimodality and multitemporality. Modes shared and monotemp involve late fusion across modalities and time steps, with parameters either shared across modalities (shared) or kept independent for each modality (monotemp). Mode group involves late fusion across predefined groups of modalities but early fusion across time steps and within each group. Mode inter-group extends group by replacing the final encoder blocks with fusion blocks that enable cross-group token interactions. Mode mod is a special case of group with late fusion across all modalities, but early fusion across time steps.
  • Figure 3: Comparison of different multimodal and multitemporal fusion modes for intra-dataset MAE-B models, ViT-B models, and baseline FMs. We report the weighted F1 score (%) on TreeSatAI-TS and the mIoU (%) on PASTIS-HD and FLAIR-HUB 20%. Results on FLAIR-HUB 20% with SatMAE and Prithvi-EO-2.0 are markedly low and therefore omitted. Refer to SM \ref{['tab:fusion_modes']} for exact numbers and additional results with CROMA.
  • Figure 4: Comparison of different choices of multispectral fusion and target normalization for intra-dataset MAE models. We report the weighted F1 score (%) on TreeSatAI-TS and the mIoU (%) on PASTIS-HD, with results shown for varying pre-training dataset fractions (top panels) and varying computational costs (bottom panels). Computational costs are measured as pre-training GFLOPs per forward pass (single batch element) for three model sizes: Small, Base, and Large. Refer to SM \ref{['tab:multispectral_fraction', 'tab:multispectral_size', 'tab:macs_flops_pretraining']} for exact numbers.
  • Figure 5: Scaling of intra-dataset MAE-B and ViT-B models across pre-training/fine-tuning dataset fractions. We report the weighted F1 score (%) on TreeSatAI-TS and the mIoU (%) on PASTIS-HD and FLAIR-HUB for three fine-tuning dataset fractions: 5%, 20%, and 100%. For each fine-tuning fraction, we compare three pre-training settings: no pre-training, pre-training on the same fraction as fine-tuning, and pre-training on 100% of the data. Refer to SM \ref{['tab:mae_pretrain_finetune']} for exact numbers.
  • ...and 3 more figures