Table of Contents
Fetching ...

How Effective is Pre-training of Large Masked Autoencoders for Downstream Earth Observation Tasks?

Jose Sosa, Mohamed Aloulou, Danila Rukhovich, Rim Sleimi, Boonyarit Changaival, Anis Kacem, Djamila Aouada

TL;DR

This work interrogates whether large ViT-based Masked Autoencoders pre-trained on EO data provide consistent downstream benefits. By comparing Setting 1 (pre-trained encoder) and Setting 2 (scratch) across reconstruction, segmentation, and classification tasks with Prithvi and SatMAE encoders, the study finds that pre-training yields gains mainly when the downstream task resembles the pre-training objective (e.g., cloud gap imputation) but often does not outperform scratch with hyperparameter tuning for segmentation tasks. The results imply that the high compute cost of pre-training may not be justified for EO, and that careful hyperparameter optimization on scratch-trained models can achieve similar or better performance. The authors argue that limitations may be linked to model design rather than data, suggesting future work should explore architecture and training strategies across more EO datasets and tasks. The work informs practitioners about the trade-offs between foundation-model pre-training and task-specific fine-tuning in Earth Observation contexts.

Abstract

Self-supervised pre-training has proven highly effective for many computer vision tasks, particularly when labelled data are scarce. In the context of Earth Observation (EO), foundation models and various other Vision Transformer (ViT)-based approaches have been successfully applied for transfer learning to downstream tasks. However, it remains unclear under which conditions pre-trained models offer significant advantages over training from scratch. In this study, we investigate the effectiveness of pre-training ViT-based Masked Autoencoders (MAE) for downstream EO tasks, focusing on reconstruction, segmentation, and classification. We consider two large ViT-based MAE pre-trained models: a foundation model (Prithvi) and SatMAE. We evaluate Prithvi on reconstruction and segmentation-based downstream tasks, and for SatMAE we assess its performance on a classification downstream task. Our findings suggest that pre-training is particularly beneficial when the fine-tuning task closely resembles the pre-training task, e.g. reconstruction. In contrast, for tasks such as segmentation or classification, training from scratch with specific hyperparameter adjustments proved to be equally or more effective.

How Effective is Pre-training of Large Masked Autoencoders for Downstream Earth Observation Tasks?

TL;DR

This work interrogates whether large ViT-based Masked Autoencoders pre-trained on EO data provide consistent downstream benefits. By comparing Setting 1 (pre-trained encoder) and Setting 2 (scratch) across reconstruction, segmentation, and classification tasks with Prithvi and SatMAE encoders, the study finds that pre-training yields gains mainly when the downstream task resembles the pre-training objective (e.g., cloud gap imputation) but often does not outperform scratch with hyperparameter tuning for segmentation tasks. The results imply that the high compute cost of pre-training may not be justified for EO, and that careful hyperparameter optimization on scratch-trained models can achieve similar or better performance. The authors argue that limitations may be linked to model design rather than data, suggesting future work should explore architecture and training strategies across more EO datasets and tasks. The work informs practitioners about the trade-offs between foundation-model pre-training and task-specific fine-tuning in Earth Observation contexts.

Abstract

Self-supervised pre-training has proven highly effective for many computer vision tasks, particularly when labelled data are scarce. In the context of Earth Observation (EO), foundation models and various other Vision Transformer (ViT)-based approaches have been successfully applied for transfer learning to downstream tasks. However, it remains unclear under which conditions pre-trained models offer significant advantages over training from scratch. In this study, we investigate the effectiveness of pre-training ViT-based Masked Autoencoders (MAE) for downstream EO tasks, focusing on reconstruction, segmentation, and classification. We consider two large ViT-based MAE pre-trained models: a foundation model (Prithvi) and SatMAE. We evaluate Prithvi on reconstruction and segmentation-based downstream tasks, and for SatMAE we assess its performance on a classification downstream task. Our findings suggest that pre-training is particularly beneficial when the fine-tuning task closely resembles the pre-training task, e.g. reconstruction. In contrast, for tasks such as segmentation or classification, training from scratch with specific hyperparameter adjustments proved to be equally or more effective.
Paper Structure (11 sections, 5 figures, 7 tables)

This paper contains 11 sections, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Settings for evaluation of ViT-based models. Setting 1 indicates that the encoder $E$ coupled with $M_i$ has been initialised with pre-trained weights. Setting 2 denotes the same, but without relying on pre-trained weights for $E$. Task related metrics have been compared for both settings to assess the effect of the self-supervised pre-training stage.
  • Figure 2: Training MAE for cloud imputation. Unlike the standard MAE which relies on fixed masking ratio, in this case we provide binary cloud masks for the inputs to train $E$ and $D$ from scratch.
  • Figure 3: Standard emsemble for segmentation tasks. For crop classification, $E$ has been coupled with a convolutional head, which could contain convolutional and linear layers. Labels are provided on the dataset for supervised training of the model.
  • Figure 4: Comparison of performance of the model under different initialisations. We calculate the mIoU for each of the crop and land cover classes on the test set. Averages for each setting appear on the legend located at the top of the plot.
  • Figure 5: Different initialisation for crop segmentation with simulated cloudy data. Corresponding mIoU for each masking ratio with different initialisation for ViT Encoder.