Table of Contents
Fetching ...

Limitations of Public Chest Radiography Datasets for Artificial Intelligence: Label Quality, Domain Shift, Bias and Evaluation Challenges

Amy Rafferty, Ajitha Rajan

Abstract

Artificial intelligence has shown significant promise in chest radiography, where deep learning models can approach radiologist-level diagnostic performance. Progress has been accelerated by large public datasets such as MIMIC-CXR, ChestX-ray14, PadChest, and CheXpert, which provide hundreds of thousands of labelled images with pathology annotations. However, these datasets also present important limitations. Automated label extraction from radiology reports introduces errors, particularly in handling uncertainty and negation, and radiologist review frequently disagrees with assigned labels. In addition, domain shift and population bias restrict model generalisability, while evaluation practices often overlook clinically meaningful measures. We conduct a systematic analysis of these challenges, focusing on label quality, dataset bias, and domain shift. Our cross-dataset domain shift evaluation across multiple model architectures revealed substantial external performance degradation, with pronounced reductions in AUPRC and F1 scores relative to internal testing. To assess dataset bias, we trained a source-classification model that distinguished datasets with near-perfect accuracy, and performed subgroup analyses showing reduced performance for minority age and sex groups. Finally, expert review by two board-certified radiologists identified significant disagreement with public dataset labels. Our findings highlight important clinical weaknesses of current benchmarks and emphasise the need for clinician-validated datasets and fairer evaluation frameworks.

Limitations of Public Chest Radiography Datasets for Artificial Intelligence: Label Quality, Domain Shift, Bias and Evaluation Challenges

Abstract

Artificial intelligence has shown significant promise in chest radiography, where deep learning models can approach radiologist-level diagnostic performance. Progress has been accelerated by large public datasets such as MIMIC-CXR, ChestX-ray14, PadChest, and CheXpert, which provide hundreds of thousands of labelled images with pathology annotations. However, these datasets also present important limitations. Automated label extraction from radiology reports introduces errors, particularly in handling uncertainty and negation, and radiologist review frequently disagrees with assigned labels. In addition, domain shift and population bias restrict model generalisability, while evaluation practices often overlook clinically meaningful measures. We conduct a systematic analysis of these challenges, focusing on label quality, dataset bias, and domain shift. Our cross-dataset domain shift evaluation across multiple model architectures revealed substantial external performance degradation, with pronounced reductions in AUPRC and F1 scores relative to internal testing. To assess dataset bias, we trained a source-classification model that distinguished datasets with near-perfect accuracy, and performed subgroup analyses showing reduced performance for minority age and sex groups. Finally, expert review by two board-certified radiologists identified significant disagreement with public dataset labels. Our findings highlight important clinical weaknesses of current benchmarks and emphasise the need for clinician-validated datasets and fairer evaluation frameworks.

Paper Structure

This paper contains 29 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Example radiography report from the MIMIC-CXR dataset. The associated chest radiograph is labelled as Lung Cancer and Pneumonia by the automated CheXpert labeller. Pneumonia indicators (red) are negative mentions, leading to a false diagnosis.
  • Figure 2: Macro-averaged AUPRC degradation heatmaps for each model architecture (internal $-$ external). Rows = training dataset; columns = test dataset. Shared color scale across panels.
  • Figure 3: Example by-class F1 threshold selection, for the EfficientNetV2-S model trained on the MIMIC-CXR dataset. This is repeated for all architectures, on each training dataset. These internal thresholds are used to assess model performance on external test sets.
  • Figure 4: Macro-averaged F1 score degradation heatmaps (internal $-$ external) for all model architectures. Rows = training dataset; columns = test dataset. Shared color scale across panels.
  • Figure 5: Example chest radiographs from MIMIC-CXR, CheXpert, ChestX-ray14 and PadChest. Each image contains text artefacts.
  • ...and 1 more figures