PhilEO Bench: Evaluating Geo-Spatial Foundation Models
Casper Fibaek, Luke Camilleri, Andreas Luyts, Nikolaos Dionelis, Bertrand Le Saux
TL;DR
This work tackles the challenge of labeled data scarcity in Earth Observation by proposing PhilEO Bench, a unified, fair evaluation framework for geo-spatial foundation models. It introduces a global 400 GB Sentinel-2 dataset with three downstream tasks (building density, road segmentation, land cover) and standardizes evaluation through a fixed test set and a common decoder, enabling valid cross-model comparisons across n-shot transfers and fine-tuning vs linear probing. Experiments show that state-of-the-art FM approaches can be outperformed by conventional architectures like U-Net on image-to-image tasks, underscoring the importance of evaluation design and decoder choice. The framework and dataset aim to drive reproducible progress in EO foundation models and facilitate practical deployment.
Abstract
Massive amounts of unlabelled data are captured by Earth Observation (EO) satellites, with the Sentinel-2 constellation generating 1.6 TB of data daily. This makes Remote Sensing a data-rich domain well suited to Machine Learning (ML) solutions. However, a bottleneck in applying ML models to EO is the lack of annotated data as annotation is a labour-intensive and costly process. As a result, research in this domain has focused on Self-Supervised Learning and Foundation Model approaches. This paper addresses the need to evaluate different Foundation Models on a fair and uniform benchmark by introducing the PhilEO Bench, a novel evaluation framework for EO Foundation Models. The framework comprises of a testbed and a novel 400 GB Sentinel-2 dataset containing labels for three downstream tasks, building density estimation, road segmentation, and land cover classification. We present experiments using our framework evaluating different Foundation Models, including Prithvi and SatMAE, at multiple n-shots and convergence rates.
