Table of Contents
Fetching ...

Distribution Shifts at Scale: Out-of-distribution Detection in Earth Observation

Burak Ekim, Girmaw Abebe Tadesse, Caleb Robinson, Gilles Hacheme, Michael Schmitt, Rahul Dodhia, Juan M. Lavista Ferres

TL;DR

The paper tackles the challenge of distribution shifts in Earth Observation by proposing TARDIS, a post-hoc OOD detector that preserves in-distribution performance while operating without labeled OOD data. It generates surrogate ID/OOD labels for unseen data by clustering internal activations of a pre-trained model and training a lightweight binary classifier on these features. Across EuroSAT and xBD, TARDIS achieves near-upper-bound surrogate-labeling performance in most setups and matches top post-hoc methods, with strong scalability demonstrated in the Fields of the World deployment. This approach enables global, real-time diagnostics of model robustness in low-data regions, offering practical, interpretable insights into distribution shifts at scale.

Abstract

Training robust deep learning models is crucial in Earth Observation, where globally deployed models often face distribution shifts that degrade performance, especially in low-data regions. Out-of-distribution (OOD) detection addresses this by identifying inputs that deviate from in-distribution (ID) data. However, existing methods either assume access to OOD data or compromise primary task performance, limiting real-world use. We introduce TARDIS, a post-hoc OOD detection method designed for scalable geospatial deployment. Our core innovation lies in generating surrogate distribution labels by leveraging ID data within the feature space. TARDIS takes a pre-trained model, ID data, and data from an unknown distribution (WILD), separates WILD into surrogate ID and OOD labels based on internal activations, and trains a binary classifier to detect distribution shifts. We validate on EuroSAT and xBD across 17 setups covering covariate and semantic shifts, showing near-upper-bound surrogate labeling performance in 13 cases and matching the performance of top post-hoc activation- and scoring-based methods. Finally, deploying TARDIS on Fields of the World reveals actionable insights into pre-trained model behavior at scale. The code is available at \href{https://github.com/microsoft/geospatial-ood-detection}{https://github.com/microsoft/geospatial-ood-detection}

Distribution Shifts at Scale: Out-of-distribution Detection in Earth Observation

TL;DR

The paper tackles the challenge of distribution shifts in Earth Observation by proposing TARDIS, a post-hoc OOD detector that preserves in-distribution performance while operating without labeled OOD data. It generates surrogate ID/OOD labels for unseen data by clustering internal activations of a pre-trained model and training a lightweight binary classifier on these features. Across EuroSAT and xBD, TARDIS achieves near-upper-bound surrogate-labeling performance in most setups and matches top post-hoc methods, with strong scalability demonstrated in the Fields of the World deployment. This approach enables global, real-time diagnostics of model robustness in low-data regions, offering practical, interpretable insights into distribution shifts at scale.

Abstract

Training robust deep learning models is crucial in Earth Observation, where globally deployed models often face distribution shifts that degrade performance, especially in low-data regions. Out-of-distribution (OOD) detection addresses this by identifying inputs that deviate from in-distribution (ID) data. However, existing methods either assume access to OOD data or compromise primary task performance, limiting real-world use. We introduce TARDIS, a post-hoc OOD detection method designed for scalable geospatial deployment. Our core innovation lies in generating surrogate distribution labels by leveraging ID data within the feature space. TARDIS takes a pre-trained model, ID data, and data from an unknown distribution (WILD), separates WILD into surrogate ID and OOD labels based on internal activations, and trains a binary classifier to detect distribution shifts. We validate on EuroSAT and xBD across 17 setups covering covariate and semantic shifts, showing near-upper-bound surrogate labeling performance in 13 cases and matching the performance of top post-hoc activation- and scoring-based methods. Finally, deploying TARDIS on Fields of the World reveals actionable insights into pre-trained model behavior at scale. The code is available at \href{https://github.com/microsoft/geospatial-ood-detection}{https://github.com/microsoft/geospatial-ood-detection}

Paper Structure

This paper contains 21 sections, 1 equation, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Overview of the proposed OOD detection method. Given a pre-trained model, ID samples, and WILD samples (from unknown distributions), TARDIS assigns surrogate ID/OOD labels to WILD samples using the ID set and fits a binary classifier $g$ (top row). During deployment, $g$ uses internal activations of unseen samples to predict whether they are ID or OOD (bottom row).
  • Figure 2: The proposed framework consists of four key steps: (1) Sampling in-distribution (ID) and WILD samples; (2) Extracting internal activations from a pre-trained model $f$ for both ID and WILD samples; (3) Clustering the combined feature space and labeling WILD samples as surrogate-ID or surrogate-OOD; (4) Fitting a binary classifier $g$ on the labeled feature representations to distinguish between ID and OOD samples. The classifier $g$, during deployment, flags out-of-distribution inputs.
  • Figure 3: Geographical distribution of ID and WILD sets, containing 500 and 1200 samples, respectively. The ID set is sampled from the FTW dataset training set, while the WILD set is randomly sampled from the Microsoft Planetary Computer. Each Sentinel-2 patch, provided in two different time frames (planting and harvesting). The model $f$ takes both Sentinel-2 images from different seasons as input, and the predictions are shown on the right. The OOD probability values are thresholded at 0.5.
  • Figure 4: Correlation between model performance and OOD score skewness. Low performance of the $f$ model on the FTW test set aligns with low skewness in the OOD classifier $g$’s score distribution, suggesting the presence of OOD samples in the test set.
  • Figure 5: Examples from the xBD dataset, illustrating pre- and post-disaster images. These samples demonstrate the temporal and semantic differences between pre- and post-disaster scenes, highlighting the challenges posed by distribution shifts.
  • ...and 7 more figures