Disparate Model Performance and Stability in Machine Learning Clinical Support for Diabetes and Heart Diseases
Ioannis Bilionis, Ricardo C. Berrios, Luis Fernandez-Luque, Carlos Castillo
TL;DR
Facing health equity challenges in ML for chronic disease support, the paper introduces a three-pronged framework to evaluate fairness beyond representativeness. By analyzing AUROC across sex and age with cross-validated ensembles, quantifying data complexity, and measuring systematic arbitrariness via self-consistency and KS tests on seven chronic-disease datasets, the authors uncover mild male/female differences and more pronounced age-related disparities, especially for older patients. They show older patients experience inconsistent predictions correlated with higher data complexity and higher arbitrariness, and that significant data augmentation would sometimes be required to reach parity. The results argue that representativeness alone does not guarantee equity and call for comprehensive fairness audits, alignment of data complexity with performance, and monitoring of model stability before clinical deployment. The study provides a practical methodology and empirical evidence supporting cautious, ethics-informed deployment of ML clinical decision support.
Abstract
Machine Learning (ML) algorithms are vital for supporting clinical decision-making in biomedical informatics. However, their predictive performance can vary across demographic groups, often due to the underrepresentation of historically marginalized populations in training datasets. The investigation reveals widespread sex- and age-related inequities in chronic disease datasets and their derived ML models. Thus, a novel analytical framework is introduced, combining systematic arbitrariness with traditional metrics like accuracy and data complexity. The analysis of data from over 25,000 individuals with chronic diseases revealed mild sex-related disparities, favoring predictive accuracy for males, and significant age-related differences, with better accuracy for younger patients. Notably, older patients showed inconsistent predictive accuracy across seven datasets, linked to higher data complexity and lower model performance. This highlights that representativeness in training data alone does not guarantee equitable outcomes, and model arbitrariness must be addressed before deploying models in clinical settings.
