Table of Contents
Fetching ...

Foundation for unbiased cross-validation of spatio-temporal models for species distribution modeling

Diana Koldasbayeva, Alexey Zaytsev

TL;DR

This paper addresses the risk that spatial and temporal autocorrelation inflates SDM evaluation metrics when using conventional random cross-validation. It systematically benchmarks four machine learning algorithms across multiple SAC-aware CV designs on two real-world presence-absence datasets, comparing final training strategies RETRAIN and LAST FOLD with extensive hyperparameter tuning. Key findings show that random CV can overestimate AUC by up to 0.16 and inflate MAE by ~75%, while SAC-aligned spatial blocking substantially reduces bias; LAST FOLD offers conservative validation in high-SAC contexts but is not universally superior. The authors propose a practical, SAC-aware SDM workflow—estimate SAC range, apply blocking, tune within blocked CV, and validate with external temporal data—and provide open-source tooling to improve transferability under spatial and temporal shifts.

Abstract

Evaluating the predictive performance of species distribution models (SDMs) under realistic deployment scenarios requires careful handling of spatial and temporal dependencies in the data. Cross-validation (CV) is the standard approach for model evaluation, but its design strongly influences the validity of performance estimates. When SDMs are intended for spatial or temporal transfer, random CV can lead to overoptimistic results due to spatial autocorrelation (SAC) among neighboring observations. We benchmark four machine learning algorithms (GBM, XGBoost, LightGBM, Random Forest) on two real-world presence-absence datasets, a temperate plant and an anadromous fish, using multiple CV designs: random, spatial, spatio-temporal, environmental, and forward-chaining. Two training data usage strategies (LAST FOLD and RETRAIN) are evaluated, with hyperparameter tuning performed within each CV scheme. Model performance is assessed on independent out-of-time test sets using AUC, MAE, and correlation metrics. Random CV overestimates AUC by up to 0.16 and produces MAE values up to 80 percent higher than spatially blocked alternatives. Blocking at the empirical SAC range substantially reduces this bias. Training strategy affects evaluation outcomes: LAST FOLD yields smaller validation-test discrepancies under strong SAC, while RETRAIN achieves higher test AUC when SAC is weaker. Boosted ensemble models consistently perform best under spatially structured CV designs. We recommend a robust SDM workflow based on SAC-aware blocking, blocked hyperparameter tuning, and external temporal validation to improve reliability under spatial and temporal shifts.

Foundation for unbiased cross-validation of spatio-temporal models for species distribution modeling

TL;DR

This paper addresses the risk that spatial and temporal autocorrelation inflates SDM evaluation metrics when using conventional random cross-validation. It systematically benchmarks four machine learning algorithms across multiple SAC-aware CV designs on two real-world presence-absence datasets, comparing final training strategies RETRAIN and LAST FOLD with extensive hyperparameter tuning. Key findings show that random CV can overestimate AUC by up to 0.16 and inflate MAE by ~75%, while SAC-aligned spatial blocking substantially reduces bias; LAST FOLD offers conservative validation in high-SAC contexts but is not universally superior. The authors propose a practical, SAC-aware SDM workflow—estimate SAC range, apply blocking, tune within blocked CV, and validate with external temporal data—and provide open-source tooling to improve transferability under spatial and temporal shifts.

Abstract

Evaluating the predictive performance of species distribution models (SDMs) under realistic deployment scenarios requires careful handling of spatial and temporal dependencies in the data. Cross-validation (CV) is the standard approach for model evaluation, but its design strongly influences the validity of performance estimates. When SDMs are intended for spatial or temporal transfer, random CV can lead to overoptimistic results due to spatial autocorrelation (SAC) among neighboring observations. We benchmark four machine learning algorithms (GBM, XGBoost, LightGBM, Random Forest) on two real-world presence-absence datasets, a temperate plant and an anadromous fish, using multiple CV designs: random, spatial, spatio-temporal, environmental, and forward-chaining. Two training data usage strategies (LAST FOLD and RETRAIN) are evaluated, with hyperparameter tuning performed within each CV scheme. Model performance is assessed on independent out-of-time test sets using AUC, MAE, and correlation metrics. Random CV overestimates AUC by up to 0.16 and produces MAE values up to 80 percent higher than spatially blocked alternatives. Blocking at the empirical SAC range substantially reduces this bias. Training strategy affects evaluation outcomes: LAST FOLD yields smaller validation-test discrepancies under strong SAC, while RETRAIN achieves higher test AUC when SAC is weaker. Boosted ensemble models consistently perform best under spatially structured CV designs. We recommend a robust SDM workflow based on SAC-aware blocking, blocked hyperparameter tuning, and external temporal validation to improve reliability under spatial and temporal shifts.

Paper Structure

This paper contains 49 sections, 11 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of our modeling and validation design. Past environmental and occurrence data are split into two time intervals: one for model training and validation, and another held out for final testing. Within the training interval, we apply several CV strategies - random, spatial, spatio-temporal, environmental blocking and TimeSeriesSplit - each tested with a range of hyperparameter sets and spatial distances to evaluate model performance using ROC AUC. The best-performing model is selected using either the RETRAIN or LAST FOLD strategy and then evaluated on the temporally independent hold-out data. To assess robustness, we compare AUC scores from CV and hold-out evaluation using mean absolute error (MAE), Pearson correlation ($r$), and Spearman correlation ($\rho$).
  • Figure : (a) Gentianella campestris
  • Figure : (a) Gentianella campestris