Table of Contents
Fetching ...

The CAST package for training and assessment of spatial prediction models in R

Hanna Meyer, Marvin Ludwig, Carles Milà, Jan Linnenbrink, Fabian Schumacher

TL;DR

The paper addresses the challenge of producing spatially explicit environmental maps when training data are spatially clustered and non-i.i.d. It introduces the CAST package, which implements prediction-oriented cross-validation ($NNDM$ and $kNNDM$), area-of-applicability (AOA) with a dissimilarity index and local data density, and pixel-level performance estimation via error profiles to quantify uncertainty. Through a South America plant richness case study, CAST demonstrates how to perform robust model tuning, avoid overfitting, and restrict predictions to domains where learned relationships are valid. The work provides a practical, R-based toolkit that improves reliability and interpretability of spatial predictions and uncertainties, with plans to broaden compatibility with mlr3 and tidymodels.

Abstract

One key task in environmental science is to map environmental variables continuously in space or even in space and time. Machine learning algorithms are frequently used to learn from local field observations to make spatial predictions by estimating the value of the variable of interest in places where it has not been measured. However, the application of machine learning strategies for spatial mapping involves additional challenges compared to "non-spatial" prediction tasks that often originate from spatial autocorrelation and from training data that are not independent and identically distributed. In the past few years, we developed a number of methods to support the application of machine learning for spatial data which involves the development of suitable cross-validation strategies for performance assessment and model selection, spatial feature selection, and methods to assess the area of applicability of the trained models. The intention of the CAST package is to support the application of machine learning strategies for predictive mapping by implementing such methods and making them available for easy integration into modelling workflows. Here we introduce the CAST package and its core functionalities. At the case study of mapping plant species richness, we will go through the different steps of the modelling workflow and show how CAST can be used to support more reliable spatial predictions.

The CAST package for training and assessment of spatial prediction models in R

TL;DR

The paper addresses the challenge of producing spatially explicit environmental maps when training data are spatially clustered and non-i.i.d. It introduces the CAST package, which implements prediction-oriented cross-validation ( and ), area-of-applicability (AOA) with a dissimilarity index and local data density, and pixel-level performance estimation via error profiles to quantify uncertainty. Through a South America plant richness case study, CAST demonstrates how to perform robust model tuning, avoid overfitting, and restrict predictions to domains where learned relationships are valid. The work provides a practical, R-based toolkit that improves reliability and interpretability of spatial predictions and uncertainties, with plans to broaden compatibility with mlr3 and tidymodels.

Abstract

One key task in environmental science is to map environmental variables continuously in space or even in space and time. Machine learning algorithms are frequently used to learn from local field observations to make spatial predictions by estimating the value of the variable of interest in places where it has not been measured. However, the application of machine learning strategies for spatial mapping involves additional challenges compared to "non-spatial" prediction tasks that often originate from spatial autocorrelation and from training data that are not independent and identically distributed. In the past few years, we developed a number of methods to support the application of machine learning for spatial data which involves the development of suitable cross-validation strategies for performance assessment and model selection, spatial feature selection, and methods to assess the area of applicability of the trained models. The intention of the CAST package is to support the application of machine learning strategies for predictive mapping by implementing such methods and making them available for easy integration into modelling workflows. Here we introduce the CAST package and its core functionalities. At the case study of mapping plant species richness, we will go through the different steps of the modelling workflow and show how CAST can be used to support more reliable spatial predictions.
Paper Structure (9 sections, 9 figures, 1 table)

This paper contains 9 sections, 9 figures, 1 table.

Figures (9)

  • Figure 1: A very simple workflow for a spatial prediction mapping workflow, indicating which function in CAST can be used in the different steps to support the spatial prediction.
  • Figure 2: Location of the reference data from sPlotOpen and example predictor variables (an excerpt of predictors_raster) for the desired prediction domain.
  • Figure 3: A first prediction of plant species richness in South America.
  • Figure 4: Nearest neighbor distance distribution represented as density plot (left) and empirical cumulative distribution function plot (right). Both show that prediction requires an application of the model far beyond the clustered reference data. Aim of the cross-validation strategies implemented in CAST is to produce similar prediction-to-sample distances based on the available training data.
  • Figure 5: Comparison of cross-validation methods: random folds and their corresponding nearest neighbor distance distribution (top) as well as kNNDM folds and their nearest neighbor distance distribution (bottom)
  • ...and 4 more figures