Table of Contents
Fetching ...

Slicing Through Bias: Explaining Performance Gaps in Medical Image Analysis using Slice Discovery Methods

Vincent Olesen, Nina Weng, Aasa Feragen, Eike Petersen

TL;DR

This study demonstrates the effectiveness of SDMs in hypothesis formulation and yields an explanation of previously observed but unexplained performance disparities between male and female patients in widely used chest X-ray datasets and models.

Abstract

Machine learning models have achieved high overall accuracy in medical image analysis. However, performance disparities on specific patient groups pose challenges to their clinical utility, safety, and fairness. This can affect known patient groups - such as those based on sex, age, or disease subtype - as well as previously unknown and unlabeled groups. Furthermore, the root cause of such observed performance disparities is often challenging to uncover, hindering mitigation efforts. In this paper, to address these issues, we leverage Slice Discovery Methods (SDMs) to identify interpretable underperforming subsets of data and formulate hypotheses regarding the cause of observed performance disparities. We introduce a novel SDM and apply it in a case study on the classification of pneumothorax and atelectasis from chest x-rays. Our study demonstrates the effectiveness of SDMs in hypothesis formulation and yields an explanation of previously observed but unexplained performance disparities between male and female patients in widely used chest X-ray datasets and models. Our findings indicate shortcut learning in both classification tasks, through the presence of chest drains and ECG wires, respectively. Sex-based differences in the prevalence of these shortcut features appear to cause the observed classification performance gap, representing a previously underappreciated interaction between shortcut learning and model fairness analyses.

Slicing Through Bias: Explaining Performance Gaps in Medical Image Analysis using Slice Discovery Methods

TL;DR

This study demonstrates the effectiveness of SDMs in hypothesis formulation and yields an explanation of previously observed but unexplained performance disparities between male and female patients in widely used chest X-ray datasets and models.

Abstract

Machine learning models have achieved high overall accuracy in medical image analysis. However, performance disparities on specific patient groups pose challenges to their clinical utility, safety, and fairness. This can affect known patient groups - such as those based on sex, age, or disease subtype - as well as previously unknown and unlabeled groups. Furthermore, the root cause of such observed performance disparities is often challenging to uncover, hindering mitigation efforts. In this paper, to address these issues, we leverage Slice Discovery Methods (SDMs) to identify interpretable underperforming subsets of data and formulate hypotheses regarding the cause of observed performance disparities. We introduce a novel SDM and apply it in a case study on the classification of pneumothorax and atelectasis from chest x-rays. Our study demonstrates the effectiveness of SDMs in hypothesis formulation and yields an explanation of previously observed but unexplained performance disparities between male and female patients in widely used chest X-ray datasets and models. Our findings indicate shortcut learning in both classification tasks, through the presence of chest drains and ECG wires, respectively. Sex-based differences in the prevalence of these shortcut features appear to cause the observed classification performance gap, representing a previously underappreciated interaction between shortcut learning and model fairness analyses.
Paper Structure (12 sections, 6 figures, 1 table)

This paper contains 12 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: A general overview of the key elements of slice discovery methods.
  • Figure 2: The comorbidity and chest drain distribution in pneumothorax-positive chest drain annotated samples of NIH-CXR14 for the worst-performing (left column) and best-performing (right column) slices by Brier score. The pneumothorax-negative case is omitted as chest drain annotations were not available for these samples in NIH-CXR14.
  • Figure 3: AUROC on CheXpert with male and female test subjects on pneumothorax prediction, following the natural ('unbalanced') distribution of chest drains and balanced by chest drain presence across ten samplings of the train-validation-test sets.
  • Figure 4: The comorbidity and chest drain distribution in (a) pneumothorax-negative (top row) and (b) pneumothorax-positive chest drain annotated samples of CheXpert for the worst-performing (left column) and best-performing (right column) slices by upper 95% bootstrapped Brier scores.
  • Figure 5: Distribution of confidences (the softmax output of the model for the disease-positive class) for pneumothorax classification by sex, presence of pneumothorax, and chest drain for NIH-CXR14 (left) and CheXpert (right). Throughout, subjects without chest drains are more likely to be classified as pneumothorax-negative.
  • ...and 1 more figures