Table of Contents
Fetching ...

Can Moran Eigenvectors Improve Machine Learning of Spatial Data? Insights from Synthetic Data Validation

Ziqi Li, Zhan Peng

TL;DR

This work evaluates whether incorporating Moran Eigenvectors into machine learning improves the handling of spatial effects in data. By synthetic data across a grid and US counties, the authors compare coordinates-only features against Moran Eigenvectors derived from Queen contiguity and exponential kernels, with eigenvector selection via LASSO (MSE/BIC). Four ML algorithms (Random Forest, XGBoost, LightGBM, TabNet) are benchmarked alongside ESF-SVC and linear baselines, with GeoShapley providing process-level explanations of spatial vs non-spatial contributions. The findings indicate that coordinates alone often outperform eigenvector-based inputs, especially when nonlinear ML models are used; Moran Eigenvectors may still be useful for network autocorrelation or negative spatial autocorrelation, and GeoShapley offers a valuable explainability tool for diagnosing spatial effects in ML models.

Abstract

Moran Eigenvector Spatial Filtering (ESF) approaches have shown promise in accounting for spatial effects in statistical models. Can this extend to machine learning? This paper examines the effectiveness of using Moran Eigenvectors as additional spatial features in machine learning models. We generate synthetic datasets with known processes involving spatially varying and nonlinear effects across two different geometries. Moran Eigenvectors calculated from different spatial weights matrices, with and without a priori eigenvector selection, are tested. We assess the performance of popular machine learning models, including Random Forests, LightGBM, XGBoost, and TabNet, and benchmark their accuracies in terms of cross-validated R2 values against models that use only coordinates as features. We also extract coefficients and functions from the models using GeoShapley and compare them with the true processes. Results show that machine learning models using only location coordinates achieve better accuracies than eigenvector-based approaches across various experiments and datasets. Furthermore, we discuss that while these findings are relevant for spatial processes that exhibit positive spatial autocorrelation, they do not necessarily apply when modeling network autocorrelation and cases with negative spatial autocorrelation, where Moran Eigenvectors would still be useful.

Can Moran Eigenvectors Improve Machine Learning of Spatial Data? Insights from Synthetic Data Validation

TL;DR

This work evaluates whether incorporating Moran Eigenvectors into machine learning improves the handling of spatial effects in data. By synthetic data across a grid and US counties, the authors compare coordinates-only features against Moran Eigenvectors derived from Queen contiguity and exponential kernels, with eigenvector selection via LASSO (MSE/BIC). Four ML algorithms (Random Forest, XGBoost, LightGBM, TabNet) are benchmarked alongside ESF-SVC and linear baselines, with GeoShapley providing process-level explanations of spatial vs non-spatial contributions. The findings indicate that coordinates alone often outperform eigenvector-based inputs, especially when nonlinear ML models are used; Moran Eigenvectors may still be useful for network autocorrelation or negative spatial autocorrelation, and GeoShapley offers a valuable explainability tool for diagnosing spatial effects in ML models.

Abstract

Moran Eigenvector Spatial Filtering (ESF) approaches have shown promise in accounting for spatial effects in statistical models. Can this extend to machine learning? This paper examines the effectiveness of using Moran Eigenvectors as additional spatial features in machine learning models. We generate synthetic datasets with known processes involving spatially varying and nonlinear effects across two different geometries. Moran Eigenvectors calculated from different spatial weights matrices, with and without a priori eigenvector selection, are tested. We assess the performance of popular machine learning models, including Random Forests, LightGBM, XGBoost, and TabNet, and benchmark their accuracies in terms of cross-validated R2 values against models that use only coordinates as features. We also extract coefficients and functions from the models using GeoShapley and compare them with the true processes. Results show that machine learning models using only location coordinates achieve better accuracies than eigenvector-based approaches across various experiments and datasets. Furthermore, we discuss that while these findings are relevant for spatial processes that exhibit positive spatial autocorrelation, they do not necessarily apply when modeling network autocorrelation and cases with negative spatial autocorrelation, where Moran Eigenvectors would still be useful.

Paper Structure

This paper contains 14 sections, 6 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Workflow
  • Figure 2: True data generating processes for Grid.
  • Figure 3: True data generating processes for US Counties .
  • Figure 4: Illustration of selected Moran Eigenvectors for Grid
  • Figure 5: Illustration of selected Moran Eigenvectors for US Counties
  • ...and 5 more figures