Foundation for unbiased cross-validation of spatio-temporal models for species distribution modeling
Diana Koldasbayeva, Alexey Zaytsev
TL;DR
This paper addresses the risk that spatial and temporal autocorrelation inflates SDM evaluation metrics when using conventional random cross-validation. It systematically benchmarks four machine learning algorithms across multiple SAC-aware CV designs on two real-world presence-absence datasets, comparing final training strategies RETRAIN and LAST FOLD with extensive hyperparameter tuning. Key findings show that random CV can overestimate AUC by up to 0.16 and inflate MAE by ~75%, while SAC-aligned spatial blocking substantially reduces bias; LAST FOLD offers conservative validation in high-SAC contexts but is not universally superior. The authors propose a practical, SAC-aware SDM workflow—estimate SAC range, apply blocking, tune within blocked CV, and validate with external temporal data—and provide open-source tooling to improve transferability under spatial and temporal shifts.
Abstract
Evaluating the predictive performance of species distribution models (SDMs) under realistic deployment scenarios requires careful handling of spatial and temporal dependencies in the data. Cross-validation (CV) is the standard approach for model evaluation, but its design strongly influences the validity of performance estimates. When SDMs are intended for spatial or temporal transfer, random CV can lead to overoptimistic results due to spatial autocorrelation (SAC) among neighboring observations. We benchmark four machine learning algorithms (GBM, XGBoost, LightGBM, Random Forest) on two real-world presence-absence datasets, a temperate plant and an anadromous fish, using multiple CV designs: random, spatial, spatio-temporal, environmental, and forward-chaining. Two training data usage strategies (LAST FOLD and RETRAIN) are evaluated, with hyperparameter tuning performed within each CV scheme. Model performance is assessed on independent out-of-time test sets using AUC, MAE, and correlation metrics. Random CV overestimates AUC by up to 0.16 and produces MAE values up to 80 percent higher than spatially blocked alternatives. Blocking at the empirical SAC range substantially reduces this bias. Training strategy affects evaluation outcomes: LAST FOLD yields smaller validation-test discrepancies under strong SAC, while RETRAIN achieves higher test AUC when SAC is weaker. Boosted ensemble models consistently perform best under spatially structured CV designs. We recommend a robust SDM workflow based on SAC-aware blocking, blocked hyperparameter tuning, and external temporal validation to improve reliability under spatial and temporal shifts.
