Table of Contents
Fetching ...

Evaluating the Fairness of Neural Collapse in Medical Image Classification

Kaouther Mouheb, Marawan Elbatel, Stefan Klein, Esther E. Bron

TL;DR

This study investigates deep learning fairness through the lens of NC by analyzing the training dynamics of models as they approach NC when training using biased datasets, and examines the subsequent impact on test performance, specifically focusing on label bias.

Abstract

Deep learning has achieved impressive performance across various medical imaging tasks. However, its inherent bias against specific groups hinders its clinical applicability in equitable healthcare systems. A recently discovered phenomenon, Neural Collapse (NC), has shown potential in improving the generalization of state-of-the-art deep learning models. Nonetheless, its implications on bias in medical imaging remain unexplored. Our study investigates deep learning fairness through the lens of NC. We analyze the training dynamics of models as they approach NC when training using biased datasets, and examine the subsequent impact on test performance, specifically focusing on label bias. We find that biased training initially results in different NC configurations across subgroups, before converging to a final NC solution by memorizing all data samples. Through extensive experiments on three medical imaging datasets -- PAPILA, HAM10000, and CheXpert -- we find that in biased settings, NC can lead to a significant drop in F1 score across all subgroups. Our code is available at https://gitlab.com/radiology/neuro/neural-collapse-fairness

Evaluating the Fairness of Neural Collapse in Medical Image Classification

TL;DR

This study investigates deep learning fairness through the lens of NC by analyzing the training dynamics of models as they approach NC when training using biased datasets, and examines the subsequent impact on test performance, specifically focusing on label bias.

Abstract

Deep learning has achieved impressive performance across various medical imaging tasks. However, its inherent bias against specific groups hinders its clinical applicability in equitable healthcare systems. A recently discovered phenomenon, Neural Collapse (NC), has shown potential in improving the generalization of state-of-the-art deep learning models. Nonetheless, its implications on bias in medical imaging remain unexplored. Our study investigates deep learning fairness through the lens of NC. We analyze the training dynamics of models as they approach NC when training using biased datasets, and examine the subsequent impact on test performance, specifically focusing on label bias. We find that biased training initially results in different NC configurations across subgroups, before converging to a final NC solution by memorizing all data samples. Through extensive experiments on three medical imaging datasets -- PAPILA, HAM10000, and CheXpert -- we find that in biased settings, NC can lead to a significant drop in F1 score across all subgroups. Our code is available at https://gitlab.com/radiology/neuro/neural-collapse-fairness
Paper Structure (11 sections, 5 equations, 4 figures, 1 table)

This paper contains 11 sections, 5 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: A 2D example of variability collapse under label noise. The crosses (x) are positive samples (+) from Group 1 (orange) that are mistakenly classified as negative samples (-). In early training stages the majority of them are close to the positive class mean (right arrows) leading to poor train NC but a high performance on unbiased data. The final phase of training drives all noisy samples closer to the negative class mean (left arrows) leading to an optimal train collapse but a drop in test performance (Colored figure available online).
  • Figure 2: NC1 metric per epoch for each dataset-attribute combination. Biased training (solid orange line) exhibits higher initial NC1 values and slower convergence to NC compared to unbiased training (dashed blue line). Shaded areas represent the standard deviation across 10 random seeds.
  • Figure 3: AUC of the SPLIT test for sensitive information encoded in extracted features against subgroup separability of the raw data. While data points in early stage training (a) are on the y=x axis, this is not found in the final stage of training (b), indicating that models closer to NC remove group information. Error bars represent the standard deviation across 10 random seeds.
  • Figure 4: Test-time differences in NC1 and F1 scores between biased and unbiased models. A positive value of $\Delta$NC1 means the biased model exhibits a higher NC1 reflecting a worse test NC. A negative value of $\Delta$F1 score indicates that the biased model achieves worse F1 score compared to the model trained with clean data. The * denotes statistically significant F1 score difference.