Dataset Distribution Impacts Model Fairness: Single vs. Multi-Task Learning

Ralf Raumanns; Gerard Schouten; Josien P. W. Pluim; Veronika Cheplygina

Dataset Distribution Impacts Model Fairness: Single vs. Multi-Task Learning

Ralf Raumanns, Gerard Schouten, Josien P. W. Pluim, Veronika Cheplygina

TL;DR

This paper addresses how dataset bias, particularly sex distribution, affects fairness in skin lesion classification using CNNs. It introduces a linear programming (LP) approach to curate ISIC-derived datasets with controlled sex and age distributions and evaluates three learning strategies—single-task, reinforcing multi-task, and adversarial debiasing—on ResNet50 with losses $L_c$ and $L_{br}$ (with $\$lambda=5$). Key findings show substantial bias in the base model, limited bias mitigation by reinforcement, notable bias reduction by adversarial debiasing in female-only contexts, and improved male-subgroup performance when male-inclusive data are used; the best results occur when each sex is trained on its own data. The study highlights the challenge of achieving consistent fairness under skewed distributions and provides a general, reproducible LP-based dataset construction framework, with data and code available on GitHub.

Abstract

The influence of bias in datasets on the fairness of model predictions is a topic of ongoing research in various fields. We evaluate the performance of skin lesion classification using ResNet-based CNNs, focusing on patient sex variations in training data and three different learning strategies. We present a linear programming method for generating datasets with varying patient sex and class labels, taking into account the correlations between these variables. We evaluated the model performance using three different learning strategies: a single-task model, a reinforcing multi-task model, and an adversarial learning scheme. Our observations include: 1) sex-specific training data yields better results, 2) single-task models exhibit sex bias, 3) the reinforcement approach does not remove sex bias, 4) the adversarial model eliminates sex bias in cases involving only female patients, and 5) datasets that include male patients enhance model performance for the male subgroup, even when female patients are the majority. To generalise these findings, in future research, we will examine more demographic attributes, like age, and other possibly confounding factors, such as skin colour and artefacts in the skin lesions. We make all data and models available on GitHub.

Dataset Distribution Impacts Model Fairness: Single vs. Multi-Task Learning

TL;DR

and

(with

lambda=5$). Key findings show substantial bias in the base model, limited bias mitigation by reinforcement, notable bias reduction by adversarial debiasing in female-only contexts, and improved male-subgroup performance when male-inclusive data are used; the best results occur when each sex is trained on its own data. The study highlights the challenge of achieving consistent fairness under skewed distributions and provides a general, reproducible LP-based dataset construction framework, with data and code available on GitHub.

Abstract

Paper Structure (6 sections, 16 equations, 2 figures, 2 tables)

This paper contains 6 sections, 16 equations, 2 figures, 2 tables.

Introduction
Methods
Results
Discussion and conclusions
Acknowledgments.
Disclosure of Interests.

Figures (2)

Figure 1: Steps for filtering lesions and creating test, training and validation sets. Steps 5 through 7 are repeated using 5 different seeds in a cross-validation setup.
Figure 2: The AUC score varies based on data splits ranging from only male patients (M100) to only female patients (F100). We show base, reinforcing and adversarial model performance for female and male patient subgroups. Significance per Mann–Whitney U test (as used in larrazabal2020gender) is denoted by **** $(P \leq 0.0001)$, *** $(0.0001 < P \leq 0.001)$, ** $(0.001 < P \leq 0.01)$, * $(0.01 < P \leq 0.1)$, and not significant (ns) $(P > 0.1)$. < indicates lower AUCs, > higher AUCs, and = comparable AUCs for female patients.

Dataset Distribution Impacts Model Fairness: Single vs. Multi-Task Learning

TL;DR

Abstract

Dataset Distribution Impacts Model Fairness: Single vs. Multi-Task Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (2)