SpaCE: The Spatial Confounding Environment

Mauricio Tec; Ana Trisovic; Michelle Audirac; Sophie Woodward; Jie Kate Hu; Naeem Khoshnevis; Francesca Dominici

SpaCE: The Spatial Confounding Environment

Mauricio Tec, Ana Trisovic, Michelle Audirac, Sophie Woodward, Jie Kate Hu, Naeem Khoshnevis, Francesca Dominici

TL;DR

SpaCE tackles spatial confounding by offering a comprehensive benchmark ecosystem that combines real covariates and treatments with semi-synthetic outcomes and ground-truth counterfactuals. The approach uses a two-stage pipeline to generate SpaceEnvs and then SpaceDatasets, where counterfactuals are produced as $\tilde{Y}_s^a = f(X_s,a) + R_s$ with $f$ learned via AutoML and $R_s$ drawn from a Gaussian Markov Random Field to preserve spatial autocorrelation. Key contributions include a diverse set of DataCollections (ranging from thousands to millions of nodes), an automated end-to-end pipeline, spatially aware cross-validation, and standardized tasks/metrics (e.g., average treatment effect $\tau_{ate}$, exposure-response function $\tau_{erf}(a)$, and individualized treatment effects $\tilde{Y}_s^a$) for evaluating causal inference methods under spatial confounding. The work enables rigorous, comparable benchmarking across methods and datasets, with practical impact for developing robust spatial causal tools and guiding future methodological research in real-world spatial domains.

Abstract

Spatial confounding poses a significant challenge in scientific studies involving spatial data, where unobserved spatial variables can influence both treatment and outcome, possibly leading to spurious associations. To address this problem, we introduce SpaCE: The Spatial Confounding Environment, the first toolkit to provide realistic benchmark datasets and tools for systematically evaluating causal inference methods designed to alleviate spatial confounding. Each dataset includes training data, true counterfactuals, a spatial graph with coordinates, and smoothness and confounding scores characterizing the effect of a missing spatial confounder. It also includes realistic semi-synthetic outcomes and counterfactuals, generated using state-of-the-art machine learning ensembles, following best practices for causal inference benchmarks. The datasets cover real treatment and covariates from diverse domains, including climate, health and social sciences. SpaCE facilitates an automated end-to-end pipeline, simplifying data loading, experimental setup, and evaluating machine learning and causal inference models. The SpaCE project provides several dozens of datasets of diverse sizes and spatial complexity. It is publicly available as a Python package, encouraging community feedback and contributions.

SpaCE: The Spatial Confounding Environment

TL;DR

with

learned via AutoML and

drawn from a Gaussian Markov Random Field to preserve spatial autocorrelation. Key contributions include a diverse set of DataCollections (ranging from thousands to millions of nodes), an automated end-to-end pipeline, spatially aware cross-validation, and standardized tasks/metrics (e.g., average treatment effect

, exposure-response function

, and individualized treatment effects

) for evaluating causal inference methods under spatial confounding. The work enables rigorous, comparable benchmarking across methods and datasets, with practical impact for developing robust spatial causal tools and guiding future methodological research in real-world spatial domains.

Abstract

Paper Structure (19 sections, 2 equations, 10 figures, 9 tables, 1 algorithm)

This paper contains 19 sections, 2 equations, 10 figures, 9 tables, 1 algorithm.

Introduction
Background on Spatial Confounding
SpaCE: The Spatial Confounding Environment
Details about the Data Generation Pipeline
Examples and Experiments
Conclusion and Discussion
Data collections
SpaceEnv Generation Details
Training $f$ using AutoML
Sampling ${\bm{R}}$ using a Gaussian Markov Random Field
Code and API
Sharing and documentation of collections and environments
Distribution of spatial and confounding scores
SpaCE API: Accessing environments, making benchmark datasets, and evaluating them
Hyper-parameter Tuning and Additional details of benchmarks
...and 4 more sections

Figures (10)

Figure 1: SpaCE encapsulates essential components necessary for causal effect estimation algorithms using spatial data, including ground-truth counterfactuals.
Figure 2: Causal diagram of spatial confounding with neighbors $s$ and $s'$. Arrows represent causal relations; undirected dotted lines represent non-necessarily causal associations. The correlations increase as the distance between $s$ and $s'$ decreases.
Figure 3: Example synthetic outcome and residuals from the synthetic data generation in the healthd_pollutn_mortality_cont environment.
Figure 4: (Top): The semi-synthetic data generation pipeline and acquisition. (Bottom): Summary of the key terms used in SpaCE.
Figure 5: Examples using the cdcsvi_nohsdp_poverty_cont environment.
...and 5 more figures

SpaCE: The Spatial Confounding Environment

TL;DR

Abstract

SpaCE: The Spatial Confounding Environment

Authors

TL;DR

Abstract

Table of Contents

Figures (10)