Table of Contents
Fetching ...

On Biases in a UK Biobank-based Retinal Image Classification Model

Anissa Alloula, Rima Mustafa, Daniel R McGowan, Bartłomiej W. Papież

TL;DR

This study investigates whether disparities are present in the UK Biobank fundus retinal images by training and evaluating a disease classification model and finds substantial differences despite strong overall performance of the model.

Abstract

Recent work has uncovered alarming disparities in the performance of machine learning models in healthcare. In this study, we explore whether such disparities are present in the UK Biobank fundus retinal images by training and evaluating a disease classification model on these images. We assess possible disparities across various population groups and find substantial differences despite strong overall performance of the model. In particular, we discover unfair performance for certain assessment centres, which is surprising given the rigorous data standardisation protocol. We compare how these differences emerge and apply a range of existing bias mitigation methods to each one. A key insight is that each disparity has unique properties and responds differently to the mitigation methods. We also find that these methods are largely unable to enhance fairness, highlighting the need for better bias mitigation methods tailored to the specific type of bias.

On Biases in a UK Biobank-based Retinal Image Classification Model

TL;DR

This study investigates whether disparities are present in the UK Biobank fundus retinal images by training and evaluating a disease classification model and finds substantial differences despite strong overall performance of the model.

Abstract

Recent work has uncovered alarming disparities in the performance of machine learning models in healthcare. In this study, we explore whether such disparities are present in the UK Biobank fundus retinal images by training and evaluating a disease classification model on these images. We assess possible disparities across various population groups and find substantial differences despite strong overall performance of the model. In particular, we discover unfair performance for certain assessment centres, which is surprising given the rigorous data standardisation protocol. We compare how these differences emerge and apply a range of existing bias mitigation methods to each one. A key insight is that each disparity has unique properties and responds differently to the mitigation methods. We also find that these methods are largely unable to enhance fairness, highlighting the need for better bias mitigation methods tailored to the specific type of bias.
Paper Structure (19 sections, 7 figures, 5 tables)

This paper contains 19 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: For some subgroupings, the baseline model shows large disparities in test set AUC between worst- and best- performing subgroups, far below and above the average AUC of 0.71. Error bars represent standard deviation across the three random seeds.
  • Figure 2: Kernel density estimation of the first 4 principal components (PC) of the features extracted from the baseline model's penultimate layer grouped by centre. Table of mean Wasserstein distance of features between one centre and the other 5 for the 3 random seeds. f's feature distribution is clearly an outlier across some PCs.
  • Figure 3: Overall AUC of age mitigation models (left) and centre mitigation models (right) relative to worst-group AUC. Most models worsen both overall and minimum performance relative to the baseline (red point), especially for age mitigation. Error bars represent standard deviation for 3 random seeds.
  • Figure A1: Baseline data characteristics and SBP distribution.
  • Figure A2: Age (top) and centre (bottom) AUC evolution during a baseline training run.
  • ...and 2 more figures