PhilEO Bench: Evaluating Geo-Spatial Foundation Models

Casper Fibaek; Luke Camilleri; Andreas Luyts; Nikolaos Dionelis; Bertrand Le Saux

PhilEO Bench: Evaluating Geo-Spatial Foundation Models

Casper Fibaek, Luke Camilleri, Andreas Luyts, Nikolaos Dionelis, Bertrand Le Saux

TL;DR

This work tackles the challenge of labeled data scarcity in Earth Observation by proposing PhilEO Bench, a unified, fair evaluation framework for geo-spatial foundation models. It introduces a global 400 GB Sentinel-2 dataset with three downstream tasks (building density, road segmentation, land cover) and standardizes evaluation through a fixed test set and a common decoder, enabling valid cross-model comparisons across n-shot transfers and fine-tuning vs linear probing. Experiments show that state-of-the-art FM approaches can be outperformed by conventional architectures like U-Net on image-to-image tasks, underscoring the importance of evaluation design and decoder choice. The framework and dataset aim to drive reproducible progress in EO foundation models and facilitate practical deployment.

Abstract

Massive amounts of unlabelled data are captured by Earth Observation (EO) satellites, with the Sentinel-2 constellation generating 1.6 TB of data daily. This makes Remote Sensing a data-rich domain well suited to Machine Learning (ML) solutions. However, a bottleneck in applying ML models to EO is the lack of annotated data as annotation is a labour-intensive and costly process. As a result, research in this domain has focused on Self-Supervised Learning and Foundation Model approaches. This paper addresses the need to evaluate different Foundation Models on a fair and uniform benchmark by introducing the PhilEO Bench, a novel evaluation framework for EO Foundation Models. The framework comprises of a testbed and a novel 400 GB Sentinel-2 dataset containing labels for three downstream tasks, building density estimation, road segmentation, and land cover classification. We present experiments using our framework evaluating different Foundation Models, including Prithvi and SatMAE, at multiple n-shots and convergence rates.

PhilEO Bench: Evaluating Geo-Spatial Foundation Models

TL;DR

Abstract

Paper Structure (11 sections, 5 figures, 1 table)

This paper contains 11 sections, 5 figures, 1 table.

Introduction
Related work
The PhilEO Evaluation Framework
PhilEO Downstream Dataset
Evaluation framework
Evaluating Foundation Models
Experiment for the three downstream tasks
Implementation details for the experiment
Results and discussion
Conclusion
Acknowledgement

Figures (5)

Figure 1: The PhilEO Suite, where complex neural networks are trained on massive data. On the right, the PhilEO Bench evaluates such foundation models on diverse downstream tasks.
Figure 2: From left: 1) RGB visualisation of S2 patch, 2) Land cover labels, 3) Building Density labels, and 4) Road Density.
Figure 3: Land cover classification accuracy (acc) using PhilEO Bench for $n$ samples per region with different architectures and transfer learning paradigms: (1) Linear probing (lp), and (2) Fine-tuning (ft). The legend is for both this plot and Fig. \ref{['fig:results_buildings']}.
Figure 4: MSE for building density regression using the PhilEO Bench for $n$ samples per region, evaluating different models.
Figure 5: From left: (1) RGB visualisation of Sentinel-2 patch, (2) Building Density labels, and (3) Model predictions. Top row is a U-Net model, while bottom row is the Prithvi model.

PhilEO Bench: Evaluating Geo-Spatial Foundation Models

TL;DR

Abstract

PhilEO Bench: Evaluating Geo-Spatial Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)