Table of Contents
Fetching ...

SSL4Eco: A Global Seasonal Dataset for Geospatial Foundation Models in Ecology

Elena Plekhanova, Damien Robert, Johannes Dollinger, Emilia Arens, Philipp Brun, Jan Dirk Wegner, Niklaus Zimmermann

TL;DR

SSL4Eco introduces a global, phenology-informed, multi-date Sentinel-2 dataset designed to train geospatial foundation models for ecology with uniform land coverage and local seasonal sampling. By pairing SSL4Eco with SeCo-Eco, a seasonality-aware model, the work demonstrates that simple, phenology-driven dataset construction yields consistent improvements across diverse macroecological tasks, achieving state-of-the-art on 7 of 8 benchmarks. The study also shows calendar-based sampling underperforms the EVI-informed approach, underscoring the importance of local vegetation cycles in representation learning. The authors provide public access to data, code, and pretrained weights to foster broader ecological and computer-vision research with environmental relevance.

Abstract

With the exacerbation of the biodiversity and climate crises, macroecological pursuits such as global biodiversity mapping become more urgent. Remote sensing offers a wealth of Earth observation data for ecological studies, but the scarcity of labeled datasets remains a major challenge. Recently, self-supervised learning has enabled learning representations from unlabeled data, triggering the development of pretrained geospatial models with generalizable features. However, these models are often trained on datasets biased toward areas of high human activity, leaving entire ecological regions underrepresented. Additionally, while some datasets attempt to address seasonality through multi-date imagery, they typically follow calendar seasons rather than local phenological cycles. To better capture vegetation seasonality at a global scale, we propose a simple phenology-informed sampling strategy and introduce corresponding SSL4Eco, a multi-date Sentinel-2 dataset, on which we train an existing model with a season-contrastive objective. We compare representations learned from SSL4Eco against other datasets on diverse ecological downstream tasks and demonstrate that our straightforward sampling method consistently improves representation quality, highlighting the importance of dataset construction. The model pretrained on SSL4Eco reaches state of the art performance on 7 out of 8 downstream tasks spanning (multi-label) classification and regression. We release our code, data, and model weights to support macroecological and computer vision research at https://github.com/PlekhanovaElena/ssl4eco.

SSL4Eco: A Global Seasonal Dataset for Geospatial Foundation Models in Ecology

TL;DR

SSL4Eco introduces a global, phenology-informed, multi-date Sentinel-2 dataset designed to train geospatial foundation models for ecology with uniform land coverage and local seasonal sampling. By pairing SSL4Eco with SeCo-Eco, a seasonality-aware model, the work demonstrates that simple, phenology-driven dataset construction yields consistent improvements across diverse macroecological tasks, achieving state-of-the-art on 7 of 8 benchmarks. The study also shows calendar-based sampling underperforms the EVI-informed approach, underscoring the importance of local vegetation cycles in representation learning. The authors provide public access to data, code, and pretrained weights to foster broader ecological and computer-vision research with environmental relevance.

Abstract

With the exacerbation of the biodiversity and climate crises, macroecological pursuits such as global biodiversity mapping become more urgent. Remote sensing offers a wealth of Earth observation data for ecological studies, but the scarcity of labeled datasets remains a major challenge. Recently, self-supervised learning has enabled learning representations from unlabeled data, triggering the development of pretrained geospatial models with generalizable features. However, these models are often trained on datasets biased toward areas of high human activity, leaving entire ecological regions underrepresented. Additionally, while some datasets attempt to address seasonality through multi-date imagery, they typically follow calendar seasons rather than local phenological cycles. To better capture vegetation seasonality at a global scale, we propose a simple phenology-informed sampling strategy and introduce corresponding SSL4Eco, a multi-date Sentinel-2 dataset, on which we train an existing model with a season-contrastive objective. We compare representations learned from SSL4Eco against other datasets on diverse ecological downstream tasks and demonstrate that our straightforward sampling method consistently improves representation quality, highlighting the importance of dataset construction. The model pretrained on SSL4Eco reaches state of the art performance on 7 out of 8 downstream tasks spanning (multi-label) classification and regression. We release our code, data, and model weights to support macroecological and computer vision research at https://github.com/PlekhanovaElena/ssl4eco.

Paper Structure

This paper contains 39 sections, 5 figures, 17 tables.

Figures (5)

  • Figure 1: We propose SSL4Eco, a multi-date Sentinel-2 dataset for pretraining foundation models targeted for macroecological applications. Unlike comparable datasets (a), SSL4Eco uniformly covers the entire landmass (b), thus capturing all environment types without favoring urban and agricultural areas, or ignoring entire ecoregions (c).
  • Figure 2: Unlike previous works which sample seasonal images based on calendar dates schmitt2019sen12mssecossl4eo (dashed lines in (a)), we define phenology-informed, local seasons based the Enhanced Vegetation Index justice1998moderatemcduserguide (colored sections in (a)). As a result, our SSL4Eco dataset covers the full cycle of vegetation activity at each location (b), capturing patterns otherwise missed by calendar sampling.
  • Figure 3: Linear Probing performance across all datasets. We compare SeCo-Eco against the respective best-performing model among our reported set of baselines.
  • Figure A-1: Enhanced Vegetation Index (EVI) curve of the vegetation cycle at a given location. Based on this curve, the Greenup, Maturity, Senescence, and Dormancy seasonality variables are defined as detailed in Tab. \ref{['tab:evi_definition']}. Image taken from mcduserguide.
  • Figure A-2: Spatial distribution of the four new downstream tasks created for this work. We sample Biomes and CHELSA locations uniformly across the landmass. Meanwhile, the CAVM dataset is located in arctic regions and EU-Forest is limited to Europe.