Table of Contents
Fetching ...

GeoPlant: Spatial Plant Species Prediction Dataset

Lukas Picek, Christophe Botella, Maximilien Servajean, César Leblanc, Rémi Palard, Théo Larcher, Benjamin Deneu, Diego Marcos, Pierre Bonnet, Alexis Joly

TL;DR

GeoPlant delivers the largest continental-scale, multimodal plant SDM dataset to date, integrating 5M Presence-Only and 90k Presence-Absence records across 38 European countries with rich environmental rasters, Sentinel-2 imagery, and Landsat time series over two decades. It provides a Kaggle-hosted benchmark and open baselines to enable robust, multimodal SDM evaluation, addressing biases in PO data through standardized PA evaluations and offering diverse predictors (climate, land cover, soil, elevation, human footprint) at high spatial resolution. The work demonstrates that multimodal ensembles leveraging satellite imagery, time-series, and climate data outperform single-modality and traditional methods, and proposes a practical approach to estimating the number of species per survey to improve multi-label predictions. GeoPlant aims to accelerate ecological modeling and biodiversity monitoring at continental scales, while acknowledging biases, imbalances, and computational demands inherent in such diverse data.

Abstract

The difficulty of monitoring biodiversity at fine scales and over large areas limits ecological knowledge and conservation efforts. To fill this gap, Species Distribution Models (SDMs) predict species across space from spatially explicit features. Yet, they face the challenge of integrating the rich but heterogeneous data made available over the past decade, notably millions of opportunistic species observations and standardized surveys, as well as multimodal remote sensing data. In light of that, we have designed and developed a new European-scale dataset for SDMs at high spatial resolution (10--50m), including more than 10k species (i.e., most of the European flora). The dataset comprises 5M heterogeneous Presence-Only records and 90k exhaustive Presence-Absence survey records, all accompanied by diverse environmental rasters (e.g., elevation, human footprint, and soil) traditionally used in SDMs. In addition, it provides Sentinel-2 RGB and NIR satellite images with 10 m resolution, a 20-year time series of climatic variables, and satellite time series from the Landsat program. In addition to the data, we provide an openly accessible SDM benchmark (hosted on Kaggle), which has already attracted an active community and a set of strong baselines for single predictor/modality and multimodal approaches. All resources, e.g., the dataset, pre-trained models, and baseline methods (in the form of notebooks), are available on Kaggle, allowing one to start with our dataset literally with two mouse clicks.

GeoPlant: Spatial Plant Species Prediction Dataset

TL;DR

GeoPlant delivers the largest continental-scale, multimodal plant SDM dataset to date, integrating 5M Presence-Only and 90k Presence-Absence records across 38 European countries with rich environmental rasters, Sentinel-2 imagery, and Landsat time series over two decades. It provides a Kaggle-hosted benchmark and open baselines to enable robust, multimodal SDM evaluation, addressing biases in PO data through standardized PA evaluations and offering diverse predictors (climate, land cover, soil, elevation, human footprint) at high spatial resolution. The work demonstrates that multimodal ensembles leveraging satellite imagery, time-series, and climate data outperform single-modality and traditional methods, and proposes a practical approach to estimating the number of species per survey to improve multi-label predictions. GeoPlant aims to accelerate ecological modeling and biodiversity monitoring at continental scales, while acknowledging biases, imbalances, and computational demands inherent in such diverse data.

Abstract

The difficulty of monitoring biodiversity at fine scales and over large areas limits ecological knowledge and conservation efforts. To fill this gap, Species Distribution Models (SDMs) predict species across space from spatially explicit features. Yet, they face the challenge of integrating the rich but heterogeneous data made available over the past decade, notably millions of opportunistic species observations and standardized surveys, as well as multimodal remote sensing data. In light of that, we have designed and developed a new European-scale dataset for SDMs at high spatial resolution (10--50m), including more than 10k species (i.e., most of the European flora). The dataset comprises 5M heterogeneous Presence-Only records and 90k exhaustive Presence-Absence survey records, all accompanied by diverse environmental rasters (e.g., elevation, human footprint, and soil) traditionally used in SDMs. In addition, it provides Sentinel-2 RGB and NIR satellite images with 10 m resolution, a 20-year time series of climatic variables, and satellite time series from the Landsat program. In addition to the data, we provide an openly accessible SDM benchmark (hosted on Kaggle), which has already attracted an active community and a set of strong baselines for single predictor/modality and multimodal approaches. All resources, e.g., the dataset, pre-trained models, and baseline methods (in the form of notebooks), are available on Kaggle, allowing one to start with our dataset literally with two mouse clicks.
Paper Structure (35 sections, 4 equations, 6 figures, 6 tables)

This paper contains 35 sections, 4 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Our view on Species Distribution Models (SDM). The SDM utilizes multimodal predictors (e.g., satellite, climate, and environmental data) for given GPS coordinates to predict multi-species compositions at that location.
  • Figure 2: Geo spatial scale of the dataset. While the provided Presence-Only (PO) data spans all of habitable Europe, the Presence-Absence (PA) training and test sites are primarily from France, Denmark, Switzerland, and Czechia.
  • Figure 3: Satellite image data. 128$\times$128 images from Sentinel-2. First row RGB, Second row NIR.
  • Figure 4: Time-series data cube samples. (Top row) -- 19 years of four monthly climate variables (min + max + mean temperature, and precipitation). (Bottom row), 21 years of quarterly satellite values (R, G, B, NIR, SWIR1, and SWIR2). Each column corresponds to one PA survey. The values correspond to the pixel at the observation coordinate.
  • Figure 5: Multimodal ensemble (MME) baseline. Each modality (e.g., satellite images, climatic cube, and Landsat cube) is processed through a lightweight 6-layer residual encoder (i.e., ResNet-6). The resulting embeddings are then concatenated and passed to a final classification layer and sigmoïd.
  • ...and 1 more figures