Table of Contents
Fetching ...

How Your Location Relates to Health: Variable Importance and Interpretable Machine Learning for Environmental and Sociodemographic Data

Ishaan Maitra, Raymond Lin, Eric Chen, Jon Donnelly, Sanja Šćepanović, Cynthia Rudin

TL;DR

This paper analyzes how location modulates health determinants by leveraging the MEDSAT dataset and a six-step interpretable ML pipeline that combines knockoffs-based variable filtering, multi-metric importance ranking, and spatially aware models. It couples global Generalized Additive Models with a spatially adaptive Multiscale Geographically Weighted Regression, followed by region-specific local GAMs, to reveal both universal and region-dependent predictors. Key findings include NO2 as a global predictor for asthma, hypertension, and anxiety, along with outcome-specific roles for occupation, residence duration, vegetation, and marital status; PM2.5 and solar radiation show pronounced regional and COVID-era shifts. The study demonstrates how an interpretable ML approach can uncover actionable health disparities and guide policy, while highlighting the need for local interventions (e.g., NO2 reduction in urban centers) and collaboration with public health experts.

Abstract

Health outcomes depend on complex environmental and sociodemographic factors whose effects change over location and time. Only recently has fine-grained spatial and temporal data become available to study these effects, namely the MEDSAT dataset of English health, environmental, and sociodemographic information. Leveraging this new resource, we use a variety of variable importance techniques to robustly identify the most informative predictors across multiple health outcomes. We then develop an interpretable machine learning framework based on Generalized Additive Models (GAMs) and Multiscale Geographically Weighted Regression (MGWR) to analyze both local and global spatial dependencies of each variable on various health outcomes. Our findings identify NO2 as a global predictor for asthma, hypertension, and anxiety, alongside other outcome-specific predictors related to occupation, marriage, and vegetation. Regional analyses reveal local variations with air pollution and solar radiation, with notable shifts during COVID. This comprehensive approach provides actionable insights for addressing health disparities, and advocates for the integration of interpretable machine learning in public health.

How Your Location Relates to Health: Variable Importance and Interpretable Machine Learning for Environmental and Sociodemographic Data

TL;DR

This paper analyzes how location modulates health determinants by leveraging the MEDSAT dataset and a six-step interpretable ML pipeline that combines knockoffs-based variable filtering, multi-metric importance ranking, and spatially aware models. It couples global Generalized Additive Models with a spatially adaptive Multiscale Geographically Weighted Regression, followed by region-specific local GAMs, to reveal both universal and region-dependent predictors. Key findings include NO2 as a global predictor for asthma, hypertension, and anxiety, along with outcome-specific roles for occupation, residence duration, vegetation, and marital status; PM2.5 and solar radiation show pronounced regional and COVID-era shifts. The study demonstrates how an interpretable ML approach can uncover actionable health disparities and guide policy, while highlighting the need for local interventions (e.g., NO2 reduction in urban centers) and collaboration with public health experts.

Abstract

Health outcomes depend on complex environmental and sociodemographic factors whose effects change over location and time. Only recently has fine-grained spatial and temporal data become available to study these effects, namely the MEDSAT dataset of English health, environmental, and sociodemographic information. Leveraging this new resource, we use a variety of variable importance techniques to robustly identify the most informative predictors across multiple health outcomes. We then develop an interpretable machine learning framework based on Generalized Additive Models (GAMs) and Multiscale Geographically Weighted Regression (MGWR) to analyze both local and global spatial dependencies of each variable on various health outcomes. Our findings identify NO2 as a global predictor for asthma, hypertension, and anxiety, alongside other outcome-specific predictors related to occupation, marriage, and vegetation. Regional analyses reveal local variations with air pollution and solar radiation, with notable shifts during COVID. This comprehensive approach provides actionable insights for addressing health disparities, and advocates for the integration of interpretable machine learning in public health.
Paper Structure (16 sections, 3 equations, 7 figures, 2 tables)

This paper contains 16 sections, 3 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Overview of the full methodology pipeline.
  • Figure 2: On left, NO2 concentrations. On right, asthma concentrations, 2019. Red regions have higher values.
  • Figure 3: Shape Functions with 95% CIs for NO2 with Asthma. On top, Westminster (left) and Lambeth (right). On bottom, Tower Hamlets (left) and Camden (right).
  • Figure 4: MGWR Coefficients for NO2 with Asthma (left), % Workers in Skilled Trades with Diabetes (right), 2019.
  • Figure 5: MGWR Coefficients for PM2.5 with Diabetes in 2019 (left) and 2020 (right).
  • ...and 2 more figures