Table of Contents
Fetching ...

Uncertainty and Generalizability in Foundation Models for Earth Observation

Raul Ramos-Pollan, Freddie Kalaitzis, Karthick Panner Selvam

TL;DR

This work investigates uncertainty and spatial generalizability of Earth Observation foundation models under limited labeling budgets. It conducts a large-scale ablation using eight FM embeddings from Sentinel-1 and Sentinel-2 inputs to predict seven ESA World Cover classes via chip-level linear probes across eleven AOIs, including training on external AOIs and on target AOIs with various sampling strategies. The study reveals substantial variability in generalizability and uncertainty across AOIs, tasks, and FMs, with some cases achieving correlations above $0.9$ while others remain limited, and demonstrates the impact of input modality and sampling on performance. It advocates for worldwide, reference-label-based evaluation with simple probes to guide FM selection and downstream task design under labeling constraints, and presents practical guidelines for reducing labeling budgets while maintaining reliable performance.

Abstract

We take the perspective in which we want to design a downstream task (such as estimating vegetation coverage) on a certain area of interest (AOI) with a limited labeling budget. By leveraging an existing Foundation Model (FM) we must decide whether we train a downstream model on a different but label-rich AOI hoping it generalizes to our AOI, or we split labels in our AOI for training and validating. In either case, we face choices concerning what FM to use, how to sample our AOI for labeling, etc. which affect both the performance and uncertainty of the results. In this work, we perform a large ablative study using eight existing FMs on either Sentinel 1 or Sentinel 2 as input data, and the classes from the ESA World Cover product as downstream tasks across eleven AOIs. We do repeated sampling and training, resulting in an ablation of some 500K simple linear regression models. Our results show both the limits of spatial generalizability across AOIs and the power of FMs where we are able to get over 0.9 correlation coefficient between predictions and targets on different chip level predictive tasks. And still, performance and uncertainty vary greatly across AOIs, tasks and FMs. We believe this is a key issue in practice, because there are many design decisions behind each FM and downstream task (input modalities, sampling, architectures, pretraining, etc.) and usually a downstream task designer is aware of and can decide upon a few of them. Through this work, we advocate for the usage of the methodology herein described (large ablations on reference global labels and simple probes), both when publishing new FMs, and to make informed decisions when designing downstream tasks to use them.

Uncertainty and Generalizability in Foundation Models for Earth Observation

TL;DR

This work investigates uncertainty and spatial generalizability of Earth Observation foundation models under limited labeling budgets. It conducts a large-scale ablation using eight FM embeddings from Sentinel-1 and Sentinel-2 inputs to predict seven ESA World Cover classes via chip-level linear probes across eleven AOIs, including training on external AOIs and on target AOIs with various sampling strategies. The study reveals substantial variability in generalizability and uncertainty across AOIs, tasks, and FMs, with some cases achieving correlations above while others remain limited, and demonstrates the impact of input modality and sampling on performance. It advocates for worldwide, reference-label-based evaluation with simple probes to guide FM selection and downstream task design under labeling constraints, and presents practical guidelines for reducing labeling budgets while maintaining reliable performance.

Abstract

We take the perspective in which we want to design a downstream task (such as estimating vegetation coverage) on a certain area of interest (AOI) with a limited labeling budget. By leveraging an existing Foundation Model (FM) we must decide whether we train a downstream model on a different but label-rich AOI hoping it generalizes to our AOI, or we split labels in our AOI for training and validating. In either case, we face choices concerning what FM to use, how to sample our AOI for labeling, etc. which affect both the performance and uncertainty of the results. In this work, we perform a large ablative study using eight existing FMs on either Sentinel 1 or Sentinel 2 as input data, and the classes from the ESA World Cover product as downstream tasks across eleven AOIs. We do repeated sampling and training, resulting in an ablation of some 500K simple linear regression models. Our results show both the limits of spatial generalizability across AOIs and the power of FMs where we are able to get over 0.9 correlation coefficient between predictions and targets on different chip level predictive tasks. And still, performance and uncertainty vary greatly across AOIs, tasks and FMs. We believe this is a key issue in practice, because there are many design decisions behind each FM and downstream task (input modalities, sampling, architectures, pretraining, etc.) and usually a downstream task designer is aware of and can decide upon a few of them. Through this work, we advocate for the usage of the methodology herein described (large ablations on reference global labels and simple probes), both when publishing new FMs, and to make informed decisions when designing downstream tasks to use them.
Paper Structure (18 sections, 7 figures, 4 tables)

This paper contains 18 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Distributions of the seven selected esawc classes for this work on each AOI. Percentage number indicate how much of the AOI landmass is covered by those seven classes. Dotted line separates external AOIs from target AOIs.
  • Figure 2: Example predictions of a linear probe on the embedding from FM s1-fdl2024-mae, trained with data from Europe on different esawc tasks and target AOIs. An asterisk [*] denotes a correlation coefficient greater than 0.7 as the threshold above which we will consider the embeddings do contain useful information for that task and target AOI.
  • Figure 3: Overall view of linear probes with different train AOIs, target AOIs and dowstrean tasks, showing models trained in 30K elements in external AOIs and tested with 500 elements in target AOIs. Showing the mean of 20 runs. Correlation threshold is set at 0.7 (white). Bluer positions represent greater correlation between predictions and targets, redder ones worse. Black horizontal line splits S1 and S2 FMs.
  • Figure 4: Selected ablations increasing the number of elements for test chips in the target AOI used represented with dot size in the set of values [10,50,100,500]. Squared markers represent models with Sentinel-2 input, round ones with Sentinel-1 input. Thresholds of 0.7 correlation coefficient mean and 0.05 standard deviation are shown
  • Figure 5: Predictions for one run of cases in Table \ref{['tab:resultsexternal']}. Color shows the percentage of the esawc class, either the target or the prediction.
  • ...and 2 more figures