Table of Contents
Fetching ...

Fairness Analysis of CLIP-Based Foundation Models for X-Ray Image Classification

Xiangyu Sun, Xiaoguang Zou, Yuanquan Wu, Guotai Wang, Shaoting Zhang

TL;DR

This work addresses fairness in CLIP-based foundation models applied to X-ray image classification, a domain where data distributions differ markedly from natural images. It builds a balanced NIH Chest X-ray subset (NIH 6x200) and evaluates four CLIP-based approaches (including GLoRIA, MedCLIP variants, BioMedCLIP) under zero-shot and multiple fine-tuning regimes (LP, MLP, LoRA, FT). Utility is measured by accuracy and class-wise/F1 scores, while fairness is quantified using Var_F1 across diseases and F1_Δ, EqOdds, and ECE_Δ across age and gender. Findings show zero-shot performance is limited and, although full fine-tuning boosts accuracy (e.g., MedCLIP_ViT achieving 59.6% overall), substantial fairness gaps persist, with GLoRIA offering the best disease-type fairness and MedCLIP_ViT revealing pronounced demographic biases, underscoring the need for fairness interventions in medical foundation models.

Abstract

X-ray imaging is pivotal in medical diagnostics, offering non-invasive insights into a range of health conditions. Recently, vision-language models, such as the Contrastive Language-Image Pretraining (CLIP) model, have demonstrated potential in improving diagnostic accuracy by leveraging large-scale image-text datasets. However, since CLIP was not initially designed for medical images, several CLIP-like models trained specifically on medical images have been developed. Despite their enhanced performance, issues of fairness - particularly regarding demographic attributes - remain largely unaddressed. In this study, we perform a comprehensive fairness analysis of CLIP-like models applied to X-ray image classification. We assess their performance and fairness across diverse patient demographics and disease categories using zero-shot inference and various fine-tuning techniques, including Linear Probing, Multilayer Perceptron (MLP), Low-Rank Adaptation (LoRA), and full fine-tuning. Our results indicate that while fine-tuning improves model accuracy, fairness concerns persist, highlighting the need for further fairness interventions in these foundational models.

Fairness Analysis of CLIP-Based Foundation Models for X-Ray Image Classification

TL;DR

This work addresses fairness in CLIP-based foundation models applied to X-ray image classification, a domain where data distributions differ markedly from natural images. It builds a balanced NIH Chest X-ray subset (NIH 6x200) and evaluates four CLIP-based approaches (including GLoRIA, MedCLIP variants, BioMedCLIP) under zero-shot and multiple fine-tuning regimes (LP, MLP, LoRA, FT). Utility is measured by accuracy and class-wise/F1 scores, while fairness is quantified using Var_F1 across diseases and F1_Δ, EqOdds, and ECE_Δ across age and gender. Findings show zero-shot performance is limited and, although full fine-tuning boosts accuracy (e.g., MedCLIP_ViT achieving 59.6% overall), substantial fairness gaps persist, with GLoRIA offering the best disease-type fairness and MedCLIP_ViT revealing pronounced demographic biases, underscoring the need for fairness interventions in medical foundation models.

Abstract

X-ray imaging is pivotal in medical diagnostics, offering non-invasive insights into a range of health conditions. Recently, vision-language models, such as the Contrastive Language-Image Pretraining (CLIP) model, have demonstrated potential in improving diagnostic accuracy by leveraging large-scale image-text datasets. However, since CLIP was not initially designed for medical images, several CLIP-like models trained specifically on medical images have been developed. Despite their enhanced performance, issues of fairness - particularly regarding demographic attributes - remain largely unaddressed. In this study, we perform a comprehensive fairness analysis of CLIP-like models applied to X-ray image classification. We assess their performance and fairness across diverse patient demographics and disease categories using zero-shot inference and various fine-tuning techniques, including Linear Probing, Multilayer Perceptron (MLP), Low-Rank Adaptation (LoRA), and full fine-tuning. Our results indicate that while fine-tuning improves model accuracy, fairness concerns persist, highlighting the need for further fairness interventions in these foundational models.

Paper Structure

This paper contains 13 sections, 4 equations, 3 figures.

Figures (3)

  • Figure 1: Overview of the fairness analysis of different foundation models in X-ray image classification conducted in this study.
  • Figure 2: Performance of various models under different fine-tuning configurations.
  • Figure 3: Fairness analysis after full fine-tuning. (a) Fairness among diffident disease types in terms of variance of F1. (b) shows the F1-score of different demographic groups, and (c) and (d) show fairness metrics for age and gender.