Improving Fairness of Automated Chest X-ray Diagnosis by Contrastive Learning

Mingquan Lin; Tianhao Li; Zhaoyi Sun; Gregory Holste; Ying Ding; Fei Wang; George Shih; Yifan Peng

Improving Fairness of Automated Chest X-ray Diagnosis by Contrastive Learning

Mingquan Lin, Tianhao Li, Zhaoyi Sun, Gregory Holste, Ying Ding, Fei Wang, George Shih, Yifan Peng

TL;DR

This study tackles fairness in automated chest X-ray diagnosis by introducing a supervised contrastive learning framework that yields fair image embeddings. By pretraining a DenseNet-121 backbone with carefully defined positive and negative samples, the approach reduces subgroup biases quantified by $Δ$mAUC across sex, race, and age on two large datasets (MIDRC and NIH-CXR) while maintaining diagnostic performance. External validation on MIMIC-CXR corroborates improved fairness and generalization. Overall, the method offers a practical path to fairer radiology AI that can be integrated into clinical workflows with preserved accuracy and reliability.

Abstract

Purpose: Limited studies exploring concrete methods or approaches to tackle and enhance model fairness in the radiology domain. Our proposed AI model utilizes supervised contrastive learning to minimize bias in CXR diagnosis. Materials and Methods: In this retrospective study, we evaluated our proposed method on two datasets: the Medical Imaging and Data Resource Center (MIDRC) dataset with 77,887 CXR images from 27,796 patients collected as of April 20, 2023 for COVID-19 diagnosis, and the NIH Chest X-ray (NIH-CXR) dataset with 112,120 CXR images from 30,805 patients collected between 1992 and 2015. In the NIH-CXR dataset, thoracic abnormalities include atelectasis, cardiomegaly, effusion, infiltration, mass, nodule, pneumonia, pneumothorax, consolidation, edema, emphysema, fibrosis, pleural thickening, or hernia. Our proposed method utilizes supervised contrastive learning with carefully selected positive and negative samples to generate fair image embeddings, which are fine-tuned for subsequent tasks to reduce bias in chest X-ray (CXR) diagnosis. We evaluated the methods using the marginal AUC difference ($δ$ mAUC). Results: The proposed model showed a significant decrease in bias across all subgroups when compared to the baseline models, as evidenced by a paired T-test (p<0.0001). The $δ$ mAUC obtained by our method were 0.0116 (95\% CI, 0.0110-0.0123), 0.2102 (95% CI, 0.2087-0.2118), and 0.1000 (95\% CI, 0.0988-0.1011) for sex, race, and age on MIDRC, and 0.0090 (95\% CI, 0.0082-0.0097) for sex and 0.0512 (95% CI, 0.0512-0.0532) for age on NIH-CXR, respectively. Conclusion: Employing supervised contrastive learning can mitigate bias in CXR diagnosis, addressing concerns of fairness and reliability in deep learning-based diagnostic methods.

Improving Fairness of Automated Chest X-ray Diagnosis by Contrastive Learning

TL;DR

mAUC across sex, race, and age on two large datasets (MIDRC and NIH-CXR) while maintaining diagnostic performance. External validation on MIMIC-CXR corroborates improved fairness and generalization. Overall, the method offers a practical path to fairer radiology AI that can be integrated into clinical workflows with preserved accuracy and reliability.

Abstract

mAUC). Results: The proposed model showed a significant decrease in bias across all subgroups when compared to the baseline models, as evidenced by a paired T-test (p<0.0001). The

mAUC obtained by our method were 0.0116 (95\% CI, 0.0110-0.0123), 0.2102 (95% CI, 0.2087-0.2118), and 0.1000 (95\% CI, 0.0988-0.1011) for sex, race, and age on MIDRC, and 0.0090 (95\% CI, 0.0082-0.0097) for sex and 0.0512 (95% CI, 0.0512-0.0532) for age on NIH-CXR, respectively. Conclusion: Employing supervised contrastive learning can mitigate bias in CXR diagnosis, addressing concerns of fairness and reliability in deep learning-based diagnostic methods.

Paper Structure (16 sections, 2 equations, 5 figures, 12 tables)

This paper contains 16 sections, 2 equations, 5 figures, 12 tables.

Introduction
Materials and Methods
Dataset Acquisition
Bias Definition
Overall Architecture
Contrastive Learning Model
Downstream Prediction
Experimental Settings
Statistical Analysis
Results
Study Participants
Data investigation
Model Fairness Comparisons in MIDRC Dataset
Model Fairness Comparisons in NIH-CXR Dataset
External Validation
...and 1 more sections

Figures (5)

Figure 1: Creation of MIDRC dataset
Figure 2: The overview of the proposed workflow using the Contrastive Learning Model for Fairness. For example, a male with COVID-19 serves as the anchor image, while the image of a female with COVID-19 that follows also serves as a positive sample. On the other hand, the image of a male without COVID-19 is considered as a negative sample.
Figure 3: Forest plot of relative odds (95% confidence intervals) of COVID-19 (MIDRC) and thorax abnormality (NIH-CXR) associated with age, sex, and race.
Figure 4: $\Delta$mAUC across subgroups of sex (a), age (b), and race (c) in COVID-19 detection on the MIDRC dataset. The results are averaged over 200 times bootstrap experiment. ****: p-value $\leq$ 0.0001. Balance DenseNet-121 – DenseNet-121 with balanced empirical risk minimizationVapnik1991-ro ADV - Adversarial.Wadsworth2018-ob SCL - supervised contrastive learning.Khosla2020-kq
Figure 5: $\Delta$mAUC across subgroups of sex (a) and age (b) in the thorax abnormality detection on the NIH-CXR dataset. The results are averaged over 200 times bootstrap experiment. ****: p-value $\leq$ 0.0001. Balance DenseNet-121 – DenseNet-121 with balanced empirical risk minimization.Vapnik1991-ro ADV - Adversarial.Wadsworth2018-ob SCL - supervised contrastive learning.Khosla2020-kq

Improving Fairness of Automated Chest X-ray Diagnosis by Contrastive Learning

TL;DR

Abstract

Improving Fairness of Automated Chest X-ray Diagnosis by Contrastive Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (5)