Table of Contents
Fetching ...

Aligning Validation with Deployment: Target-Weighted Cross-Validation for Spatial Prediction

Alexander Brenning, Thomas Suesse

Abstract

Cross-validation (CV) is commonly used to estimate predictive risk when independent test data are unavailable. Its validity depends on the assumption that validation tasks are sampled from the same distribution as prediction tasks encountered during deployment. In spatial prediction and other settings with structured data, this assumption is frequently violated, leading to biased estimates of deployment risk. We propose Target-Weighted CV (TWCV), an estimator of deployment risk that accounts for discrepancies between validation and deployment task distributions, thus accounting for (1) covariate shift and (2) task-difficulty shift. We characterize prediction tasks by descriptors such as covariates and spatial configuration. TWCV assigns weights to validation losses such that the weighted empirical distribution of validation tasks matches the corresponding distribution over a target domain. The weights are obtained via calibration weighting, yielding an importance-weighted estimator that targets deployment risk. Since TWCV requires adequate coverage of the deployment distribution's support, we combine it with spatially buffered resampling that diversifies the task difficulty distribution. In a simulation study, conventional as well as spatial estimators exhibit substantial bias depending on sampling, whereas buffered TWCV remains approximately unbiased across scenarios. A case study in environmental pollution mapping further confirms that discrepancies between validation and deployment task distributions can affect performance assessment, and that buffered TWCV better reflects the prediction task over the target domain. These results establish task distribution mismatch as a primary source of CV bias in spatial prediction and show that calibration weighting combined with a suitable validation task generator provides a viable approach to estimating predictive risk under dataset shift.

Aligning Validation with Deployment: Target-Weighted Cross-Validation for Spatial Prediction

Abstract

Cross-validation (CV) is commonly used to estimate predictive risk when independent test data are unavailable. Its validity depends on the assumption that validation tasks are sampled from the same distribution as prediction tasks encountered during deployment. In spatial prediction and other settings with structured data, this assumption is frequently violated, leading to biased estimates of deployment risk. We propose Target-Weighted CV (TWCV), an estimator of deployment risk that accounts for discrepancies between validation and deployment task distributions, thus accounting for (1) covariate shift and (2) task-difficulty shift. We characterize prediction tasks by descriptors such as covariates and spatial configuration. TWCV assigns weights to validation losses such that the weighted empirical distribution of validation tasks matches the corresponding distribution over a target domain. The weights are obtained via calibration weighting, yielding an importance-weighted estimator that targets deployment risk. Since TWCV requires adequate coverage of the deployment distribution's support, we combine it with spatially buffered resampling that diversifies the task difficulty distribution. In a simulation study, conventional as well as spatial estimators exhibit substantial bias depending on sampling, whereas buffered TWCV remains approximately unbiased across scenarios. A case study in environmental pollution mapping further confirms that discrepancies between validation and deployment task distributions can affect performance assessment, and that buffered TWCV better reflects the prediction task over the target domain. These results establish task distribution mismatch as a primary source of CV bias in spatial prediction and show that calibration weighting combined with a suitable validation task generator provides a viable approach to estimating predictive risk under dataset shift.

Paper Structure

This paper contains 37 sections, 16 equations, 15 figures.

Figures (15)

  • Figure 1: Illustration of the three sampling designs and their effects on spatial prediction. Columns correspond to random, clustered, and preferentially biased sampling for one representative simulation setting (strong trend, $\rho=0.1$). The top row shows the simulated response field on $D=[0,1]^2$ together with sampled locations. The second and third rows show the corresponding prediction maps obtained from random forest and heteroskedastic regression--kriging, respectively.
  • Figure 2: Predicted annual mean NO$_2$ concentrations across Germany obtained with random forest (left) and regression--kriging models (right), along with the locations of 503 monitoring stations. Both models use the same set of topographic and demographic covariates.
  • Figure 3: Joint distribution of prediction tasks in $(x_1,d)$ space in the simulation study. Panels compare the deployment task distribution with validation tasks generated by LOOCV, random $10$-fold CV, LOBOCV, and buffered LOOCV, separately for the three sampling designs. Here, $x_1$ denotes the large-scale environmental gradient and $d$ the nearest-neighbour prediction distance to the corresponding training set.
  • Figure 4: Mean error of RMSE estimators in the simulation study for different validation schemes and model-based uncertainty estimators for one representative simulation scenario (strong trend, $\rho=0.1$). Points show the mean difference between the estimated RMSE and the true deployment RMSE, with approximate $95\%$ confidence intervals. The dashed line indicates unbiased estimation of deployment RMSE. Positive values indicate overestimation of prediction error. Refer to the Supplementary Materials for results of other simulation scenarios and estimators.
  • Figure 5: Empirical distribution of prediction tasks in the space spanned by population density and prediction distance $d$ for the 2018 NO$_2$ case study in Germany. Panels compare the deployment tasks on the 2 km target grid with validation tasks induced by LOOCV, random CV, spatial CV, kNNDM, and buffered LOOCV. Local population density is shown on the $\log(1+x)$ scale.
  • ...and 10 more figures