Table of Contents
Fetching ...

Can we trust deep learning models diagnosis? The impact of domain shift in chest radiograph classification

Eduardo H. P. Pooch, Pedro L. Ballester, Rodrigo C. Barros

TL;DR

The paper investigates how domain shift affects deep learning-based chest radiograph diagnosis by training a CheXNet-style DenseNet-121 across four large public datasets and evaluating cross-domain performance. It formalizes domain shift as the divergence between source and target distributions $p(X_s)$ and $p(X_t)$ and assesses performance with AUC on eight common radiographic findings after label harmonization. The findings reveal substantial drops in cross-dataset generalization, with the best results when training and testing on the same dataset; CheXpert and MIMIC-CXR generalize better across datasets than ChestX-ray14 or PadChest, likely due to differences in labelers and dataset scale. The study emphasizes external validation and center-specific fine-tuning to address domain shift, recommending CheXpert and MIMIC-CXR as more robust sources for developing chest radiograph classifiers.$p(X_s)$ and $p(X_t)$

Abstract

While deep learning models become more widespread, their ability to handle unseen data and generalize for any scenario is yet to be challenged. In medical imaging, there is a high heterogeneity of distributions among images based on the equipment that generates them and their parametrization. This heterogeneity triggers a common issue in machine learning called domain shift, which represents the difference between the training data distribution and the distribution of where a model is employed. A high domain shift tends to implicate in a poor generalization performance from the models. In this work, we evaluate the extent of domain shift on four of the largest datasets of chest radiographs. We show how training and testing with different datasets (e.g., training in ChestX-ray14 and testing in CheXpert) drastically affects model performance, posing a big question over the reliability of deep learning models trained on public datasets. We also show that models trained on CheXpert and MIMIC-CXR generalize better to other datasets.

Can we trust deep learning models diagnosis? The impact of domain shift in chest radiograph classification

TL;DR

The paper investigates how domain shift affects deep learning-based chest radiograph diagnosis by training a CheXNet-style DenseNet-121 across four large public datasets and evaluating cross-domain performance. It formalizes domain shift as the divergence between source and target distributions and and assesses performance with AUC on eight common radiographic findings after label harmonization. The findings reveal substantial drops in cross-dataset generalization, with the best results when training and testing on the same dataset; CheXpert and MIMIC-CXR generalize better across datasets than ChestX-ray14 or PadChest, likely due to differences in labelers and dataset scale. The study emphasizes external validation and center-specific fine-tuning to address domain shift, recommending CheXpert and MIMIC-CXR as more robust sources for developing chest radiograph classifiers. and

Abstract

While deep learning models become more widespread, their ability to handle unseen data and generalize for any scenario is yet to be challenged. In medical imaging, there is a high heterogeneity of distributions among images based on the equipment that generates them and their parametrization. This heterogeneity triggers a common issue in machine learning called domain shift, which represents the difference between the training data distribution and the distribution of where a model is employed. A high domain shift tends to implicate in a poor generalization performance from the models. In this work, we evaluate the extent of domain shift on four of the largest datasets of chest radiographs. We show how training and testing with different datasets (e.g., training in ChestX-ray14 and testing in CheXpert) drastically affects model performance, posing a big question over the reliability of deep learning models trained on public datasets. We also show that models trained on CheXpert and MIMIC-CXR generalize better to other datasets.

Paper Structure

This paper contains 7 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Example of a chest radiograph positive for consolidation randomly sampled from each of the four analyzed datasets: ChestX-ray14, CheXpert, MIMIC-CXR, and PadChest.
  • Figure 2: Dataset pixel intensity probability density function.
  • Figure 3: Average image of each of the four datasets. Last image cotains one quarter of each average image to better visualize pixel intensity differences (I - ChestX-ray14, II - CheXpert, III - MIMIC-CXR, IV - PadChest).
  • Figure 4: Performance of a model trained on ChestX-ray14 (a), CheXpert (b), MIMIC-CXR (c), and PadChest (d) on each of the four test sets.