ML4EJ: Decoding the Role of Urban Features in Shaping Environmental Injustice Using Interpretable Machine Learning

Yu-Hsuan Ho; Zhewei Liu; Cheng-Chun Lee; Ali Mostafavi

ML4EJ: Decoding the Role of Urban Features in Shaping Environmental Injustice Using Interpretable Machine Learning

Yu-Hsuan Ho, Zhewei Liu, Cheng-Chun Lee, Ali Mostafavi

TL;DR

This work investigates how heterogeneous urban features shape environmental hazard exposure inequalities for three hazards ($Urban\ Heat$, Flood, and $PM_{2.5}$) across six U.S. counties using interpretable tree-based models. By applying Random Forest and XGBoost with a 70/30 train/test split and ten-fold cross-validation, the study quantifies the extent to which urban features drive hazard disparities (via $F$-score) and extracts top contributing features through a normalized Gini-based importance ranking, aggregated into an overall importance score $I^{Overall}$. Key findings show social-demographic factors largely drive disparities, with urban heat being the most predictable hazard and county-specific patterns influencing feature importance; results also reveal limited cross-county transferability, underscoring the need for localized urban design policies. The work further demonstrates potential co-benefits across hazards when targeting urban heat through regionally aware interventions, and discusses three causal interpretations of urban features in relation to hazards. Overall, the study provides data-driven, interpretable insights to inform integrated urban design and environmental justice policy, while highlighting data and transferability limitations that motivate broader, multi-region validation.

Abstract

Understanding the key factors shaping environmental hazard exposures and their associated environmental injustice issues is vital for formulating equitable policy measures. Traditional perspectives on environmental injustice have primarily focused on the socioeconomic dimensions, often overlooking the influence of heterogeneous urban characteristics. This limited view may obstruct a comprehensive understanding of the complex nature of environmental justice and its relationship with urban design features. To address this gap, this study creates an interpretable machine learning model to examine the effects of various urban features and their non-linear interactions to the exposure disparities of three primary hazards: air pollution, urban heat, and flooding. The analysis trains and tests models with data from six metropolitan counties in the United States using Random Forest and XGBoost. The performance is used to measure the extent to which variations of urban features shape disparities in environmental hazard levels. In addition, the analysis of feature importance reveals features related to social-demographic characteristics as the most prominent urban features that shape hazard extent. Features related to infrastructure distribution and land cover are relatively important for urban heat and air pollution exposure respectively. Moreover, we evaluate the models' transferability across different regions and hazards. The results highlight limited transferability, underscoring the intricate differences among hazards and regions and the way in which urban features shape hazard exposures. The insights gleaned from this study offer fresh perspectives on the relationship among urban features and their interplay with environmental hazard exposure disparities, informing the development of more integrated urban design policies to enhance social equity and environmental injustice issues.

ML4EJ: Decoding the Role of Urban Features in Shaping Environmental Injustice Using Interpretable Machine Learning

TL;DR

This work investigates how heterogeneous urban features shape environmental hazard exposure inequalities for three hazards (

, Flood, and

) across six U.S. counties using interpretable tree-based models. By applying Random Forest and XGBoost with a 70/30 train/test split and ten-fold cross-validation, the study quantifies the extent to which urban features drive hazard disparities (via

-score) and extracts top contributing features through a normalized Gini-based importance ranking, aggregated into an overall importance score

. Key findings show social-demographic factors largely drive disparities, with urban heat being the most predictable hazard and county-specific patterns influencing feature importance; results also reveal limited cross-county transferability, underscoring the need for localized urban design policies. The work further demonstrates potential co-benefits across hazards when targeting urban heat through regionally aware interventions, and discusses three causal interpretations of urban features in relation to hazards. Overall, the study provides data-driven, interpretable insights to inform integrated urban design and environmental justice policy, while highlighting data and transferability limitations that motivate broader, multi-region validation.

Abstract

Paper Structure (17 sections, 8 equations, 8 figures, 3 tables)

This paper contains 17 sections, 8 equations, 8 figures, 3 tables.

Introduction
Data and Methodology
Study Area and Data Description
Model Training
Feature Importance Analysis
Results
Model Predictability
Feature Importance Analysis
Model Transferability
Discussion
Predictability Analysis of Hazard Exposures and Inequality
Causal Relationships Interpretations Between Urban Features and Environmental Hazards
Assessing Model Transferability Across Counties for Informed Urban Development Strategies
Closing Remarks
Data Availability
...and 2 more sections

Figures (8)

Figure 1: Study overview: The urban features constructed from datasets related to social-demographics, built environment, human mobility, and land cover are used as input features. Interpretable machine learning models are adopted for predicting the extent of spatial hazards exposure. Feature importance analysis is based on interpretations of the trained machine learning models (RQ1). By applying the model trained for one hazard or county to another, the transferability of models is evaluated to answer RQ 2 and RQ 3.
Figure 2: Spatial distribution of the exposure of environmental hazards. (a)-(f) Distributions of urban heat risk level in Fulton County, Harris County, Cook County, Wayne County, Suffolk County, and Queens County. (g)-(l) Distributions of flood risk levels in respective counties. (m)-(p) Distributions of the concentration of PM$_{2.5}$. The urban heat risk levels and flood risk levels are the average value for all households within each census tract.
Figure 3: Performance of random forest classifier and XGBoost classifier. (a) Average F-score of the random forest classifier and that of the XGBoost classifier for urban heat risk level prediction. (b) Average F-score for flood risk level prediction. (c) Average F-score for air pollution (PM$_{2.5}$ concentration) prediction. (a)-(b) Average F-scores among six counties, whereas (c) Average F-scores among four counties due to data unavailability in Fulton County and Queens County. The $\beta$ for the F-score is set as 1.5 for evaluation of all models. F-score is calculated based on test set. Both of the models yield the highest F-score for urban heat, followed by flood. In addition, random forest yields better prediction performances than XGBoost models for all three environmental hazards.
Figure 4: Inter-county and inter-hazard standard deviation ($\sigma^c$/$\sigma^h$) distribution of the F-score representing the extent of the performance difference across different counties/hazards for the same hazard/county. The average inter-county standard deviation ($\overline{\sigma^c}$) is the average of values of the inter-county standard deviation ($\sigma^c$) of all hazards. .The average inter-hazard standard deviation ($\overline{\sigma^h}$) is the average of values of the inter-hazard standard deviation ($\sigma^h$) of all counties.
Figure 5: Feature importance and the important feature distribution for each environmental hazard and each county. (a)-(f) Top seven important features for urban heat risk in each county and the proportions of different feature domains in top seven features. (g)-(l) Top seven important features for flood risk and the proportions of different feature domains in top seven features. (m)-(p) Top seven important features for air pollution (PM$_{2.5}$ concentration) and the proportions of different feature domains in the top seven features. (q) Proportions of different feature domains in overall top seven important features to each environmental hazard across all counties. The listed features in (a)-(p) are the respective important features in each county; each feature is labeled as to whether it is among the top seven important features in the overall ranking across all counties.
...and 3 more figures

ML4EJ: Decoding the Role of Urban Features in Shaping Environmental Injustice Using Interpretable Machine Learning

TL;DR

Abstract

ML4EJ: Decoding the Role of Urban Features in Shaping Environmental Injustice Using Interpretable Machine Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (8)