Fairness Analysis of CLIP-Based Foundation Models for X-Ray Image Classification
Xiangyu Sun, Xiaoguang Zou, Yuanquan Wu, Guotai Wang, Shaoting Zhang
TL;DR
This work addresses fairness in CLIP-based foundation models applied to X-ray image classification, a domain where data distributions differ markedly from natural images. It builds a balanced NIH Chest X-ray subset (NIH 6x200) and evaluates four CLIP-based approaches (including GLoRIA, MedCLIP variants, BioMedCLIP) under zero-shot and multiple fine-tuning regimes (LP, MLP, LoRA, FT). Utility is measured by accuracy and class-wise/F1 scores, while fairness is quantified using Var_F1 across diseases and F1_Δ, EqOdds, and ECE_Δ across age and gender. Findings show zero-shot performance is limited and, although full fine-tuning boosts accuracy (e.g., MedCLIP_ViT achieving 59.6% overall), substantial fairness gaps persist, with GLoRIA offering the best disease-type fairness and MedCLIP_ViT revealing pronounced demographic biases, underscoring the need for fairness interventions in medical foundation models.
Abstract
X-ray imaging is pivotal in medical diagnostics, offering non-invasive insights into a range of health conditions. Recently, vision-language models, such as the Contrastive Language-Image Pretraining (CLIP) model, have demonstrated potential in improving diagnostic accuracy by leveraging large-scale image-text datasets. However, since CLIP was not initially designed for medical images, several CLIP-like models trained specifically on medical images have been developed. Despite their enhanced performance, issues of fairness - particularly regarding demographic attributes - remain largely unaddressed. In this study, we perform a comprehensive fairness analysis of CLIP-like models applied to X-ray image classification. We assess their performance and fairness across diverse patient demographics and disease categories using zero-shot inference and various fine-tuning techniques, including Linear Probing, Multilayer Perceptron (MLP), Low-Rank Adaptation (LoRA), and full fine-tuning. Our results indicate that while fine-tuning improves model accuracy, fairness concerns persist, highlighting the need for further fairness interventions in these foundational models.
