MAESTRO: Masked AutoEncoders for Multimodal, Multitemporal, and Multispectral Earth Observation Data
Antoine Labatie, Michael Vaccaro, Nina Lardiere, Anatol Garioud, Nicolas Gonthier
TL;DR
MAESTRO advances self-supervised learning for Earth Observation by tailoring Masked Autoencoders to multimodal, multitemporal, and multispectral data. It systematically benchmarks fusion strategies and introduces a patch-group-wise spectral normalization that injects a spectral prior without increasing token counts, achieving state-of-the-art results on multitemporal tasks and strong performance elsewhere. The approach emphasizes early fusion across time steps for similar modalities and joint-token multispectral fusion with spectral priors, while maintaining computational efficiency. Extensive ablations across four EO datasets, plus cross-dataset transfer experiments, demonstrate the benefits of exploiting temporal dynamics and spectral structure in SSL for EO. The work provides reproducible experimental details and code to facilitate adoption in EO research and applications.
Abstract
Self-supervised learning holds great promise for remote sensing, but standard self-supervised methods must be adapted to the unique characteristics of Earth observation data. We take a step in this direction by conducting a comprehensive benchmark of fusion strategies and normalization schemes of reconstruction targets for multimodal, multitemporal, and multispectral Earth observation data. Based on our findings, we introduce MAESTRO, a novel adaptation of the Masked Autoencoder with optimized fusion mechanisms and a normalization scheme that incorporates a spectral prior as a self-supervisory signal. Evaluated on four Earth observation datasets in both intra- and cross-dataset settings, MAESTRO achieves state-of-the-art performance on tasks that strongly rely on multitemporal dynamics, while also remaining competitive on others. Code to reproduce all our experiments is available at https://github.com/ignf/maestro.
