From Detection to Mitigation: Addressing Bias in Deep Learning Models for Chest X-Ray Diagnosis
Clemence Mottez, Louisa Fay, Maya Varma, Sophie Ostmeier, Curtis Langlotz
TL;DR
This work targets bias in chest X-ray diagnosis by building a lightweight bias-detection and mitigation framework that replaces the CNN classifier head with an XGBoost adapter, enabling multi-label disease prediction with reduced computational cost. It demonstrates model-agnostic applicability across DenseNet-121 and ResNet-50, and shows that XGBoost-based retraining, especially when combined with active learning, reduces demographic disparities (sex, age, race) while preserving overall accuracy on CheXpert and MIMIC. The paper compares against full-model retraining and traditional bias-mitigation methods, finding competitive or superior fairness gains at a fraction of the cost, and validates clinical relevance through reduced disparities in false-negative rates and equalized odds. These results offer a practical path toward equitable deployment of deep learning in radiology, with implications for broader imaging tasks and architectures.
Abstract
Deep learning models have shown promise in improving diagnostic accuracy from chest X-rays, but they also risk perpetuating healthcare disparities when performance varies across demographic groups. In this work, we present a comprehensive bias detection and mitigation framework targeting sex, age, and race-based disparities when performing diagnostic tasks with chest X-rays. We extend a recent CNN-XGBoost pipeline to support multi-label classification and evaluate its performance across four medical conditions. We show that replacing the final layer of CNN with an eXtreme Gradient Boosting classifier improves the fairness of the subgroup while maintaining or improving the overall predictive performance. To validate its generalizability, we apply the method to different backbones, namely DenseNet-121 and ResNet-50, and achieve similarly strong performance and fairness outcomes, confirming its model-agnostic design. We further compare this lightweight adapter training method with traditional full-model training bias mitigation techniques, including adversarial training, reweighting, data augmentation, and active learning, and find that our approach offers competitive or superior bias reduction at a fraction of the computational cost. Finally, we show that combining eXtreme Gradient Boosting retraining with active learning yields the largest reduction in bias across all demographic subgroups, both in and out of distribution on the CheXpert and MIMIC datasets, establishing a practical and effective path toward equitable deep learning deployment in clinical radiology.
