Unveiling Population Heterogeneity in Health Risks Posed by Environmental Hazards Using Regression-Guided Neural Network

Jong Woo Nam; Eun Young Choi; Jennifer A. Ailshire; Yao-Yi Chiang

Unveiling Population Heterogeneity in Health Risks Posed by Environmental Hazards Using Regression-Guided Neural Network

Jong Woo Nam, Eun Young Choi, Jennifer A. Ailshire, Yao-Yi Chiang

TL;DR

A hybrid method, Regression-Guided Neural Networks (ReGNN), which integrates the flexibility of artificial neural networks (ANNs) within the structural form of a regression model and can uncover patterns of heterogeneity that would otherwise remain hidden.

Abstract

Environmental hazards place certain individuals at disproportionately higher risks. As these hazards increasingly endanger human health, precise identification of the most vulnerable population subgroups is critical for public health. Moderated multiple regression (MMR) offers a straightforward method for investigating this by adding interaction terms between the exposure to a hazard and other population characteristics to a linear regression model. However, when the vulnerabilities are hidden within a cross-section of many characteristics, MMR is often limited in its capabilities to find any meaningful discoveries. Here, we introduce a hybrid method, named regression-guided neural networks (ReGNN), which utilizes artificial neural networks (ANNs) to non-linearly combine predictors, generating a latent representation that interacts with a focal predictor (i.e. variable measuring exposure to an environmental hazard). We showcase the use of ReGNN for investigating the population heterogeneity in the health effects of exposure to air pollution (PM2.5) on cognitive functioning scores. We demonstrate that population heterogeneity that would otherwise be hidden using traditional MMR can be found using ReGNN by comparing its results to the fit results of the traditional MMR models. In essence, ReGNN is a novel tool that enhances traditional regression models by effectively summarizing and quantifying an individual's susceptibility to health risks.

Unveiling Population Heterogeneity in Health Risks Posed by Environmental Hazards Using Regression-Guided Neural Network

TL;DR

Abstract

Paper Structure (15 sections, 4 equations, 7 figures)

This paper contains 15 sections, 4 equations, 7 figures.

Introduction
Related Works
Methodology
Advantages of using ReGNN over traditional regression or machine learning models
Experiment
Dataset
Health outcome: Cognitive function
Focal predictor: Air pollution (PM2.5)
Other input variables
Experimental setup
Results
Subtle changes that occur after the losses plateau drive the representation learning in ReGNN
Explainable AI techniques help understand how each predictor contributes to the generation of the summary index
Discussion and Future Works
Acknowledgement

Figures (7)

Figure 1: Overview of how a Regression-guided Neural Network (ReGNN) is trained and analyzed. First, a neural network is embedded within an MMR equation and trained to summarize the moderators (M). Then, a twin regression model with the same equation used to train the network is fitted to compute its regression coefficients as well as their statistical significance. If a meaningful interaction is determined to be found, the trained neural network is parsed using explainable AI tools such as partial dependence.
Figure 2: Regression coefficients comparing MMR models with (left) all predictors included as moderators ($r^2$ = 0.3135) and (right) only output of the trained neural network, which we name resilience index, included as moderator ($r^2$ = 0.3148). The position along the x-axis tells each coefficient's value, along with its confidence interval, which is indicated by the error bar. Significance levels are indicated with asterisks ($^{***}$: p $<$ 0.001; $^{**}$: p $<$ 0.01; $^{*}$: p$<$0.05)
Figure 3: Predicted cognitive functioning scores based on MMR fitted with ReGNN-produced index (resilience index) as the moderating variable. Holding other independent variables to their means, the fitted MMR is used to predict the means and errors of the predicted cognitive scores for differing levels of PM2.5. Groups with low (bottom 10th percentile), median, and high resilience index (top 10th percentile) are separately plotted to show the moderating effect of the index on the effects of PM2.5 on the predicted cognitive score.
Figure 4: Trajectories of model performance metrics during the training session. (Left) Train and test losses; (Right) the p-values of the interaction term's coefficient (blue) and adjusted R-squared (red) of the twin-MMR model, fitted to the train set (solid line) and test set (dotted line) respectively. While the losses plateaus after 20 epochs, the p-values significantly decrease afterward.
Figure 5: Trajectories of log-magnitudes of ReGNN's regression coefficients. Blue shows the L2 norm of all regression coefficients ($c_{k}$, $c_{n}$, and $c^{int}$), orange shows the magnitude of the coefficient for the interaction term ($c^{int}$), and the green shows the ratio of the two. While the l2 norm stops decaying early on, $c^{int}$ decays until it reaches a minimum, and bounces back up. The ratio shows that the overall magnitude (denominator) stays almost the same after 20 epochs.
...and 2 more figures

Unveiling Population Heterogeneity in Health Risks Posed by Environmental Hazards Using Regression-Guided Neural Network

TL;DR

Abstract

Unveiling Population Heterogeneity in Health Risks Posed by Environmental Hazards Using Regression-Guided Neural Network

Authors

TL;DR

Abstract

Table of Contents

Figures (7)