Table of Contents
Fetching ...

PANGAEA: A Global and Inclusive Benchmark for Geospatial Foundation Models

Valerio Marsocci, Yuru Jia, Georges Le Bellier, David Kerekes, Liang Zeng, Sebastian Hafner, Sebastian Gerard, Eric Brune, Ritu Yadav, Ali Shibli, Heng Fang, Yifang Ban, Maarten Vergauwen, Nicolas Audebert, Andrea Nascetti

TL;DR

PANGAEA introduces a global, inclusive benchmark for geospatial foundation models, addressing fragmented evaluation by curation of diverse datasets, modalities, and temporalities. The protocol freezes encoders, trains decoders, and uses band-adaptation to compare GFMs against supervised baselines, revealing that GFMs do not consistently outperform traditional methods and that performance strongly depends on pretraining data, resolution, and temporal handling. Key findings show spectral richness and high-resolution pretraining aid dense EO tasks, while temporal aggregation and cross-region transfer remain challenging. The open-source framework aims to standardize reproducible evaluation and guide future developments in generalizable, task-focused geospatial models.

Abstract

Geospatial Foundation Models (GFMs) have emerged as powerful tools for extracting representations from Earth observation data, but their evaluation remains inconsistent and narrow. Existing works often evaluate on suboptimal downstream datasets and tasks, that are often too easy or too narrow, limiting the usefulness of the evaluations to assess the real-world applicability of GFMs. Additionally, there is a distinct lack of diversity in current evaluation protocols, which fail to account for the multiplicity of image resolutions, sensor types, and temporalities, which further complicates the assessment of GFM performance. In particular, most existing benchmarks are geographically biased towards North America and Europe, questioning the global applicability of GFMs. To overcome these challenges, we introduce PANGAEA, a standardized evaluation protocol that covers a diverse set of datasets, tasks, resolutions, sensor modalities, and temporalities. It establishes a robust and widely applicable benchmark for GFMs. We evaluate the most popular GFMs openly available on this benchmark and analyze their performance across several domains. In particular, we compare these models to supervised baselines (e.g. UNet and vanilla ViT), and assess their effectiveness when faced with limited labeled data. Our findings highlight the limitations of GFMs, under different scenarios, showing that they do not consistently outperform supervised models. PANGAEA is designed to be highly extensible, allowing for the seamless inclusion of new datasets, models, and tasks in future research. By releasing the evaluation code and benchmark, we aim to enable other researchers to replicate our experiments and build upon our work, fostering a more principled evaluation protocol for large pre-trained geospatial models. The code is available at https://github.com/VMarsocci/pangaea-bench.

PANGAEA: A Global and Inclusive Benchmark for Geospatial Foundation Models

TL;DR

PANGAEA introduces a global, inclusive benchmark for geospatial foundation models, addressing fragmented evaluation by curation of diverse datasets, modalities, and temporalities. The protocol freezes encoders, trains decoders, and uses band-adaptation to compare GFMs against supervised baselines, revealing that GFMs do not consistently outperform traditional methods and that performance strongly depends on pretraining data, resolution, and temporal handling. Key findings show spectral richness and high-resolution pretraining aid dense EO tasks, while temporal aggregation and cross-region transfer remain challenging. The open-source framework aims to standardize reproducible evaluation and guide future developments in generalizable, task-focused geospatial models.

Abstract

Geospatial Foundation Models (GFMs) have emerged as powerful tools for extracting representations from Earth observation data, but their evaluation remains inconsistent and narrow. Existing works often evaluate on suboptimal downstream datasets and tasks, that are often too easy or too narrow, limiting the usefulness of the evaluations to assess the real-world applicability of GFMs. Additionally, there is a distinct lack of diversity in current evaluation protocols, which fail to account for the multiplicity of image resolutions, sensor types, and temporalities, which further complicates the assessment of GFM performance. In particular, most existing benchmarks are geographically biased towards North America and Europe, questioning the global applicability of GFMs. To overcome these challenges, we introduce PANGAEA, a standardized evaluation protocol that covers a diverse set of datasets, tasks, resolutions, sensor modalities, and temporalities. It establishes a robust and widely applicable benchmark for GFMs. We evaluate the most popular GFMs openly available on this benchmark and analyze their performance across several domains. In particular, we compare these models to supervised baselines (e.g. UNet and vanilla ViT), and assess their effectiveness when faced with limited labeled data. Our findings highlight the limitations of GFMs, under different scenarios, showing that they do not consistently outperform supervised models. PANGAEA is designed to be highly extensible, allowing for the seamless inclusion of new datasets, models, and tasks in future research. By releasing the evaluation code and benchmark, we aim to enable other researchers to replicate our experiments and build upon our work, fostering a more principled evaluation protocol for large pre-trained geospatial models. The code is available at https://github.com/VMarsocci/pangaea-bench.

Paper Structure

This paper contains 55 sections, 19 figures, 32 tables.

Figures (19)

  • Figure 1: Normalized performance comparison of different models across various datasets and training conditions. The y-axis represents the normalized performance across the 11 PANGAEA's datasets, where the best-performing model for each dataset is assigned a value of 1 and the worst-performing model is assigned a value of 0. (a) Full Labels: Models trained on downstream tasks with access to the full labeled dataset. We observe that supervised baselines -- especially UNet -- outperform most of the models; (b) Limited Labels (10%): Models trained on downstream tasks with only 10% of labeled data. In this scenario, some GFMs -- e.g. CROMA -- excel and outperform both other GFMs and supervised baselines, although still competitive; (c) Multi-Spectral Data: Results for datasets made exclusively of multi-spectral images, i.e. HLS Burn Scars, MADOS, PASTIS-R, Sen1Floods11, Crop Type Mapping-SS, and BioMassters. Multi-spectral datasets tend to use low-resolution images, on which UNet excels. GFMs trained on high-resolution images, e.g. Scale-MAE, underperform in this setting that requires both spatial and spectral information; (d) High-Resolution Data: Results for high-resolution datasets, i.e. xView2, FiveBillionPixels, DynamicEarthNet, and SpaceNet 7. They are made of either RGB or RGB-NIR imagery. In this setting, most GFMs pre-trained on lower-resolution imagery fall behind, except CROMA and DOFA. On the other hand, GFMs pre-trained on high-resolution images and UNet perform well.
  • Figure 2: PANGAEA aims for robust evaluation across diverse downstream datasets and applications.
  • Figure 3: Geographical distribution of PANGAEA benchmark dataset across different domains.
  • Figure 4: Fine-tuning strategies of the foundation models depending on the input temporality.
  • Figure 5: Convergence of training losses for four GFMs -- one pre-trained on high-resolution (Scale-MAE), one on text-image (RemoteCLIP), one on different modalities and resolution (DOFA) and one on low-resolution Sentinel data -- and a baseline (UNet), on different datasets (i.e. different domains and resolutions).
  • ...and 14 more figures