Table of Contents
Fetching ...

SHRUG-FM: Reliability-Aware Foundation Models for Earth Observation

Kai-Hendrik Cohrs, Zuzanna Osika, Maria Gonzalez-Calabuig, Vishal Nedungadi, Ruben Cartuyvels, Steffen Knoblauch, Joppe Massant, Shruti Nath, Patrick Ebel, Vasileios Sitokonstantinou

TL;DR

SHRUG-FM addresses the reliability gap of geospatial foundation models under distribution shifts by integrating three complementary signals: input-space OOD, embedding-space OOD, and task-specific predictive uncertainty. Using burn scar segmentation with SSL4EO-S12 encodings and HydroATLAS context, it demonstrates that OOD scores correlate with degraded performance and that uncertainty-based flags can safely discard many unreliable predictions. The approach employs frozen foundation encoders plus a downstream decoder and ensembles (Deep Ensembles and MC Dropout) to quantify uncertainty at pixel and image levels, with thorough metric definitions for calibration and reliability. By linking failures to geospatial attributes and providing a dashboard for interpretable reliability assessment, SHRUG-FM offers a practical pathway toward safer deployment of GFMs in climate-sensitive applications and informs data-pretraining strategies to reduce future gaps.

Abstract

Geospatial foundation models for Earth observation often fail to perform reliably in environments underrepresented during pretraining. We introduce SHRUG-FM, a framework for reliability-aware prediction that integrates three complementary signals: out-of-distribution (OOD) detection in the input space, OOD detection in the embedding space and task-specific predictive uncertainty. Applied to burn scar segmentation, SHRUG-FM shows that OOD scores correlate with lower performance in specific environmental conditions, while uncertainty-based flags help discard many poorly performing predictions. Linking these flags to land cover attributes from HydroATLAS shows that failures are not random but concentrated in certain geographies, such as low-elevation zones and large river areas, likely due to underrepresentation in pretraining data. SHRUG-FM provides a pathway toward safer and more interpretable deployment of GFMs in climate-sensitive applications, helping bridge the gap between benchmark performance and real-world reliability.

SHRUG-FM: Reliability-Aware Foundation Models for Earth Observation

TL;DR

SHRUG-FM addresses the reliability gap of geospatial foundation models under distribution shifts by integrating three complementary signals: input-space OOD, embedding-space OOD, and task-specific predictive uncertainty. Using burn scar segmentation with SSL4EO-S12 encodings and HydroATLAS context, it demonstrates that OOD scores correlate with degraded performance and that uncertainty-based flags can safely discard many unreliable predictions. The approach employs frozen foundation encoders plus a downstream decoder and ensembles (Deep Ensembles and MC Dropout) to quantify uncertainty at pixel and image levels, with thorough metric definitions for calibration and reliability. By linking failures to geospatial attributes and providing a dashboard for interpretable reliability assessment, SHRUG-FM offers a practical pathway toward safer deployment of GFMs in climate-sensitive applications and informs data-pretraining strategies to reduce future gaps.

Abstract

Geospatial foundation models for Earth observation often fail to perform reliably in environments underrepresented during pretraining. We introduce SHRUG-FM, a framework for reliability-aware prediction that integrates three complementary signals: out-of-distribution (OOD) detection in the input space, OOD detection in the embedding space and task-specific predictive uncertainty. Applied to burn scar segmentation, SHRUG-FM shows that OOD scores correlate with lower performance in specific environmental conditions, while uncertainty-based flags help discard many poorly performing predictions. Linking these flags to land cover attributes from HydroATLAS shows that failures are not random but concentrated in certain geographies, such as low-elevation zones and large river areas, likely due to underrepresentation in pretraining data. SHRUG-FM provides a pathway toward safer and more interpretable deployment of GFMs in climate-sensitive applications, helping bridge the gap between benchmark performance and real-world reliability.

Paper Structure

This paper contains 34 sections, 10 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The SHRUG-FM framework. It computes three complementary signals: OOD detection in the input, OOD detection in the embeddings and task-specific predictive uncertainty. These are combined to enable reliability-aware predictions, flagging or abstaining from low-confidence outputs.
  • Figure 2: SHRUG-FM combines complementary signals to flag unreliable predictions. (a) The out-of-distribution (OOD) metric NCDD (in the embedding space) correlates with F1 scores (samples are grouped by HydroATLAS attributes). Low elevation, low pasture extent and large river areas are associated with lower performance and stronger OOD signals (higher NCDD) (b) A histogram of per-image test performance shows that the variance-based flag discards lower-performing samples. (c) Metrics are integrated into a dashboard, visualizing predictions, probability maps, reliability scores.
  • Figure 3: Density of distances to nearest k-means centroid for pretraining and downstream data. Downstream samples fall in higher-distance regions of the pretraining distribution (low density regions of the pretraining distribution).
  • Figure 4: Visualization of NCDD values for the downstream task overlaid on a hexagonal spatial distribution of SSL4EO-S12 across the US. Downstream data points are shown as scatter points positioned by their geographic locations. The color scale from purple to yellow represents increasing NCDD values in the raw image space, while the opacity of each point corresponds to NCDD values in the image embedding space. Eg: An opaque point corresponds to a low NCDD value in the embedding space and a yellow colored point corresponds to a high NCDD value in the image space. This highlights potential concerns regarding sampling strategy employed during pretraining.
  • Figure 5: Visualization of F1 scores for the downstream task, Elevation (avg), River Area and Pasture Extent overlaid on a hexagonal spatial distribution of SSL4EO-S12 across the US. Cross (+) points represent highest/lowest values of the set. The F1 map highlights a low-performing cluster in the southeastern US, corresponding to a region characterized by low elevation, limited pastures and a relatively large river area. Moreover, the area contains few SSL4EO-S12 pretraining data points, hinting that the foundation model could benefit from increased data representation for this region.