To which reference class do you belong? Measuring racial fairness of reference classes with normative modeling

Saige Rutherford; Thomas Wolfers; Charlotte Fraza; Nathaniel G. Harnett; Christian F. Beckmann; Henricus G. Ruhe; Andre F. Marquand

To which reference class do you belong? Measuring racial fairness of reference classes with normative modeling

Saige Rutherford, Thomas Wolfers, Charlotte Fraza, Nathaniel G. Harnett, Christian F. Beckmann, Henricus G. Ruhe, Andre F. Marquand

TL;DR

This paper investigates how the racial composition of reference classes used in normative modeling affects the interpretation of deviations in brain structure. By comparing pre-trained, race-not-included, and race-included normative models on two large neuroimaging cohorts (HCP and UKB), the authors quantify racial biases in deviation scores and residuals and demonstrate that race can be predicted from model features with high accuracy. They reveal persistent racial disparities even when race is included as a predictor, highlighting that deviations may reflect demographic mismatch with the reference class rather than true pathology. The work emphasizes the urgency of collecting more representative, granular data and promotes transparent reporting to responsibly translate normative-model deviations into clinical meaning and health equity gains.

Abstract

Reference classes in healthcare establish healthy norms, such as pediatric growth charts of height and weight, and are used to chart deviations from these norms which represent potential clinical risk. How the demographics of the reference class influence clinical interpretation of deviations is unknown. Using normative modeling, a method for building reference classes, we evaluate the fairness (racial bias) in reference models of structural brain images that are widely used in psychiatry and neurology. We test whether including race in the model creates fairer models. We predict self-reported race using the deviation scores from three different reference class normative models, to better understand bias in an integrated, multivariate sense. Across all of these tasks, we uncover racial disparities that are not easily addressed with existing data or commonly used modeling techniques. Our work suggests that deviations from the norm could be due to demographic mismatch with the reference class, and assigning clinical meaning to these deviations should be done with caution. Our approach also suggests that acquiring more representative samples is an urgent research priority.

To which reference class do you belong? Measuring racial fairness of reference classes with normative modeling

TL;DR

Abstract

Paper Structure (24 sections, 3 equations, 6 figures, 5 tables)

This paper contains 24 sections, 3 equations, 6 figures, 5 tables.

Introduction
Background and Problem Formulation
Fairness in Machine Learning for Healthcare Setting
Racial Fairness
Normative Model Setting
Cohort
Cohort Selection and Inclusion Criteria
Train/test split - Normative models
Feature Extraction
Methods
Normative Model Estimation
Evaluating Fairness of Normative Models
Qualitative evaluation: summarizing models for each racial group
Quantitative evaluation: testing for group differences
Predicting Race
...and 9 more sections

Figures (6)

Figure 1: Overview of analysis workflow. A) Normative models of brain structure were used to generate deviation scores. Three normative models were fit (pre-trained, race not included, and race included) representing two different reference classes and two sets of covariates. B) Normative models were estimated for all regions in the Destrieux atlas destrieux_automatic_2010, a commonly used anatomical brain parcellation. C) The effect of self-reported race on the distribution of normative modeling deviation scores was quantified across all three normative models. D) Self-reported race was predicted using normative modeling deviation scores as features.
Figure 2: Summary of normative model deviation scores across all three reference classes (pre-trained, race not included, and race included) in HCP and UKB datasets. A) Average (mean) deviations for all brain regions within all racial groups (columns). B) Percentage of extreme deviations (positive and negative) for all brain regions within all racial groups (columns).
Figure 3: Group differences in A) residual errors and B) deviation scores across all three reference classes (pre-trained, race not modeled, and race modeled) in HCP and UKB. The t-statistic is plotted where White individuals are group one, and Asian or Black individuals are group two. Light colors (positive t-stat) represent larger residual errors or deviations in White individuals and dark colors (negative t-stat) represent larger residual errors or deviations in Asian or Black individuals. Brain regions with statistically significant group differences after multiple comparison correction (FDRcorr $p<0.05$) are shown. The number of brain regions showing group differences for each model is shown in Table \ref{['tab:table4']}.
Figure 4: Prediction of self-reported race in A) HCP and B) UKB datasets using deviation scores from three different reference class normative models (pre-trained, race not included, and race included) as features. Performance is evaluated with confusion matrices, receiver operator characteristic (ROC) curves, and Area under the ROC curve (AUC). For the confusion matrix interpretation, the diagonal elements show where predicted label == true label, and the off-diagonal elements show mislabeled (predicted label != true label). The confusion matrices were normalized by the true labels to show ratios rather than counts. For interpreting the receiver operator characteristic (ROC) curves, we plot the performance across 5-fold cross validation (lighter colors, thin lines) and the also the mean across all folds (darker colors, thicker lines).
Figure 5: Evaluation metrics for normative models in HCP dataset.
...and 1 more figures

To which reference class do you belong? Measuring racial fairness of reference classes with normative modeling

TL;DR

Abstract

To which reference class do you belong? Measuring racial fairness of reference classes with normative modeling

Authors

TL;DR

Abstract

Table of Contents

Figures (6)