Table of Contents
Fetching ...

Statistical quantification of confounding bias in predictive modelling

Tamas Spisak

TL;DR

The paper tackles confounding bias in predictive modelling by introducing two tests based on conditional permutation testing: the partial confounder test and the full confounder test. These tests, implemented via GAM or multinomial logistic regression to model conditional distributions, evaluate whether model predictions are confounder-driven given the outcome, or whether the outcome is confounder-driven given the predictions, without re-fitting the model and with robust Type I error control under non-normal and non-linear conditions. They use an $R^2$-based test statistic and a parallel-pairwise MCMC CPT framework to generate valid null distributions, and are demonstrated on simulated data and real neuroimaging datasets (HCP and ABIDE) to identify and quantify confounding biases such as age, acquisition batch, center, and motion, and to benchmark mitigation approaches. The mlconfound package enables practical application, providing a rigorous, scalable tool to improve generalizability and neurobiological validity in predictive biomarkers derived from functional connectivity data.

Abstract

The lack of non-parametric statistical tests for confounding bias significantly hampers the development of robust, valid and generalizable predictive models in many fields of research. Here I propose the partial and full confounder tests, which, for a given confounder variable, probe the null hypotheses of unconfounded and fully confounded models, respectively. The tests provide a strict control for Type I errors and high statistical power, even for non-normally and non-linearly dependent predictions, often seen in machine learning. Applying the proposed tests on models trained on functional brain connectivity data from the Human Connectome Project and the Autism Brain Imaging Data Exchange dataset reveals confounders that were previously unreported or found to be hard to correct for with state-of-the-art confound mitigation approaches. The tests, implemented in the package mlconfound (https://mlconfound.readthedocs.io), can aid the assessment and improvement of the generalizability and neurobiological validity of predictive models and, thereby, foster the development of clinically useful machine learning biomarkers.

Statistical quantification of confounding bias in predictive modelling

TL;DR

The paper tackles confounding bias in predictive modelling by introducing two tests based on conditional permutation testing: the partial confounder test and the full confounder test. These tests, implemented via GAM or multinomial logistic regression to model conditional distributions, evaluate whether model predictions are confounder-driven given the outcome, or whether the outcome is confounder-driven given the predictions, without re-fitting the model and with robust Type I error control under non-normal and non-linear conditions. They use an -based test statistic and a parallel-pairwise MCMC CPT framework to generate valid null distributions, and are demonstrated on simulated data and real neuroimaging datasets (HCP and ABIDE) to identify and quantify confounding biases such as age, acquisition batch, center, and motion, and to benchmark mitigation approaches. The mlconfound package enables practical application, providing a rigorous, scalable tool to improve generalizability and neurobiological validity in predictive biomarkers derived from functional connectivity data.

Abstract

The lack of non-parametric statistical tests for confounding bias significantly hampers the development of robust, valid and generalizable predictive models in many fields of research. Here I propose the partial and full confounder tests, which, for a given confounder variable, probe the null hypotheses of unconfounded and fully confounded models, respectively. The tests provide a strict control for Type I errors and high statistical power, even for non-normally and non-linearly dependent predictions, often seen in machine learning. Applying the proposed tests on models trained on functional brain connectivity data from the Human Connectome Project and the Autism Brain Imaging Data Exchange dataset reveals confounders that were previously unreported or found to be hard to correct for with state-of-the-art confound mitigation approaches. The tests, implemented in the package mlconfound (https://mlconfound.readthedocs.io), can aid the assessment and improvement of the generalizability and neurobiological validity of predictive models and, thereby, foster the development of clinically useful machine learning biomarkers.

Paper Structure

This paper contains 18 sections, 23 equations, 19 figures, 2 tables.

Figures (19)

  • Figure 1: Type I error control of partial Spearman correlation, linear and GAM-based conditional permutation test. Type I error control was investigated in three example cases: normal conditional distribution with linear dependency (first row), slightly non-normal conditional distribution with linear dependency (second row) and normal conditional distribution with non-normal (sigmoid) dependency (third row). Non-normal conditional distribution on the second plot is illustrated with blue density diagrams (kurtosis: -0.8, skewness: -0.1). False positive rates for confounder contributions ($w_{yc}$) and predictive performances ($w_{y\hat{y}}$) is shown in heatmaps. The upper limit for the binomial confidence interval corresponding to $alpha=0.05$ is 0.065. Values below this threshold (colored white) indicate a valid type I error control.
  • Figure 2: Graphical representation of the proposed partial confounder test. The partial confounder test models the conditional distribution of the confounder, given the target variable, with a generalized additive model (GAM). The parallel-pairwise Markov-chain Monte-Carlo (MCMC) sampler draws permutations of the confounder variable that comply with the GAM-based conditional distribution (permutation 1, 2, ..., m). The test statistic (coefficient of determination, $R^2$) is then computed between the model predictions and the original, as well as the permuted confounder variables. The original and the permuted test statistics construct the p-value as the ratio of permuted test statistics more extreme than the original. Figure source code available as jupyter notebook: https://github.com/pni-lab/mlconfound-manuscript/blob/main/simulated/overview-fig.ipynb
  • Figure 3: Type I error control and power of the partial confounder test based on simulations with normal conditional distribution and linear dependencies. Heatmaps depict positive rates (ratio of p-values lower than 0.05, color coded as shown by the palette on the right) in various simulations settings (100 simulations per tile) with different simulation weights $w_{y\hat{y}}$ (predictive performance; horizontal axis on each heatmap), $w_{yc}$ (confounder-target association; vertical axis on each heatmap), $w_{c\hat{y}}$ (degree of confoudner bias; rows) and for different sample sizes (N, columns). Weights 0.2, 0.33, 0.4, 0.6, 0.66, 1.0 can be assigned to the following approximate explained variance values: 4%, 10%, 12%, 25%, 30%, 50%, respectively. First row contains simulations under the null hypothesis (H0, no confounder bias), rows 2-4 represent simulations from the alternative hypothesis (H1, confounding bias). Positive rates for the simulations under the null and the alternative hypotheses can be interpreted as type I error rate and statistical power, respectively. The higher 95% confidence limit for a positive rate of $alpha=0.05$ is 0.11 for each tile.
  • Figure 4: Robustness of conditional permutation based confound testing to non-normality and non-linearity. Simulations included variables with five different degrees of non-normality (top panel), as introduced with various $\delta$ and $\epsilon$ values of the sinh-arcsinh transformation (yellow: normally distributed). Fisher's kurtosis and skewness is given for each distribution. False and true positive rates in the simulations under H0 and H1, respectively, for each investigated sample size (N), are depicted by barplots for both linear and sigmoid dependency structure. Upper 95% binomial confidence limit corresponding to $alpha=0.05$ is shown with a vertical dashed line.
  • Figure 5: Acquisition batch and age bias of fluid intelligence prediction in the HCP dataset. Scatter plots and regression lines (with 95% confidence intervals) show the association of the observed (horizontals axis) and predicted (vertical axis) fluid intelligence scores with various confound regression strategies. Color-coding of the confounder variables (top: acquisition batch, bottom: age group, as shown by the corresponding legends) reveals confounder bias both for acquisition and age in the models trained on the raw data. This bias is robustly detected by the partial confounder test ($p<0.0001$) and seems to be effectively mitigated by both feature regression and COMBAT. Relation between the observed ($Gf$) and predicted ($\hat{y}$) intelligence scores and the confounder variables is given on the graphs via $R^2 values$. Both confound mitigation techniques, but especially COMBAT, improve the predictive performance. Solid red line between the confounder and the prediction means significant confounding bias, whereas blue dashed line denotes that confounder testing provided no evidence for bias. P-values are determined with the partial confounder test. P-values of the 'full' confounder test (not shown) were all less then 0.0001, i.e. the confounders did not fully drive prediction for any of the models.
  • ...and 14 more figures