Table of Contents
Fetching ...

SpaCE: The Spatial Confounding Environment

Mauricio Tec, Ana Trisovic, Michelle Audirac, Sophie Woodward, Jie Kate Hu, Naeem Khoshnevis, Francesca Dominici

TL;DR

SpaCE tackles spatial confounding by offering a comprehensive benchmark ecosystem that combines real covariates and treatments with semi-synthetic outcomes and ground-truth counterfactuals. The approach uses a two-stage pipeline to generate SpaceEnvs and then SpaceDatasets, where counterfactuals are produced as $\tilde{Y}_s^a = f(X_s,a) + R_s$ with $f$ learned via AutoML and $R_s$ drawn from a Gaussian Markov Random Field to preserve spatial autocorrelation. Key contributions include a diverse set of DataCollections (ranging from thousands to millions of nodes), an automated end-to-end pipeline, spatially aware cross-validation, and standardized tasks/metrics (e.g., average treatment effect $\tau_{ate}$, exposure-response function $\tau_{erf}(a)$, and individualized treatment effects $\tilde{Y}_s^a$) for evaluating causal inference methods under spatial confounding. The work enables rigorous, comparable benchmarking across methods and datasets, with practical impact for developing robust spatial causal tools and guiding future methodological research in real-world spatial domains.

Abstract

Spatial confounding poses a significant challenge in scientific studies involving spatial data, where unobserved spatial variables can influence both treatment and outcome, possibly leading to spurious associations. To address this problem, we introduce SpaCE: The Spatial Confounding Environment, the first toolkit to provide realistic benchmark datasets and tools for systematically evaluating causal inference methods designed to alleviate spatial confounding. Each dataset includes training data, true counterfactuals, a spatial graph with coordinates, and smoothness and confounding scores characterizing the effect of a missing spatial confounder. It also includes realistic semi-synthetic outcomes and counterfactuals, generated using state-of-the-art machine learning ensembles, following best practices for causal inference benchmarks. The datasets cover real treatment and covariates from diverse domains, including climate, health and social sciences. SpaCE facilitates an automated end-to-end pipeline, simplifying data loading, experimental setup, and evaluating machine learning and causal inference models. The SpaCE project provides several dozens of datasets of diverse sizes and spatial complexity. It is publicly available as a Python package, encouraging community feedback and contributions.

SpaCE: The Spatial Confounding Environment

TL;DR

SpaCE tackles spatial confounding by offering a comprehensive benchmark ecosystem that combines real covariates and treatments with semi-synthetic outcomes and ground-truth counterfactuals. The approach uses a two-stage pipeline to generate SpaceEnvs and then SpaceDatasets, where counterfactuals are produced as with learned via AutoML and drawn from a Gaussian Markov Random Field to preserve spatial autocorrelation. Key contributions include a diverse set of DataCollections (ranging from thousands to millions of nodes), an automated end-to-end pipeline, spatially aware cross-validation, and standardized tasks/metrics (e.g., average treatment effect , exposure-response function , and individualized treatment effects ) for evaluating causal inference methods under spatial confounding. The work enables rigorous, comparable benchmarking across methods and datasets, with practical impact for developing robust spatial causal tools and guiding future methodological research in real-world spatial domains.

Abstract

Spatial confounding poses a significant challenge in scientific studies involving spatial data, where unobserved spatial variables can influence both treatment and outcome, possibly leading to spurious associations. To address this problem, we introduce SpaCE: The Spatial Confounding Environment, the first toolkit to provide realistic benchmark datasets and tools for systematically evaluating causal inference methods designed to alleviate spatial confounding. Each dataset includes training data, true counterfactuals, a spatial graph with coordinates, and smoothness and confounding scores characterizing the effect of a missing spatial confounder. It also includes realistic semi-synthetic outcomes and counterfactuals, generated using state-of-the-art machine learning ensembles, following best practices for causal inference benchmarks. The datasets cover real treatment and covariates from diverse domains, including climate, health and social sciences. SpaCE facilitates an automated end-to-end pipeline, simplifying data loading, experimental setup, and evaluating machine learning and causal inference models. The SpaCE project provides several dozens of datasets of diverse sizes and spatial complexity. It is publicly available as a Python package, encouraging community feedback and contributions.
Paper Structure (19 sections, 2 equations, 10 figures, 9 tables, 1 algorithm)

This paper contains 19 sections, 2 equations, 10 figures, 9 tables, 1 algorithm.

Figures (10)

  • Figure 1: SpaCE encapsulates essential components necessary for causal effect estimation algorithms using spatial data, including ground-truth counterfactuals.
  • Figure 2: Causal diagram of spatial confounding with neighbors $s$ and $s'$. Arrows represent causal relations; undirected dotted lines represent non-necessarily causal associations. The correlations increase as the distance between $s$ and $s'$ decreases.
  • Figure 3: Example synthetic outcome and residuals from the synthetic data generation in the healthd_pollutn_mortality_cont environment.
  • Figure 4: (Top): The semi-synthetic data generation pipeline and acquisition. (Bottom): Summary of the key terms used in SpaCE.
  • Figure 5: Examples using the cdcsvi_nohsdp_poverty_cont environment.
  • ...and 5 more figures