Unsupervised Search for Ethnic Minorities' Medical Segmentation Training Set

Yixiao Chen; Yue Yao; Ruining Yang; Md Zakir Hossain; Ashu Gupta; Tom Gedeon

Unsupervised Search for Ethnic Minorities' Medical Segmentation Training Set

Yixiao Chen, Yue Yao, Ruining Yang, Md Zakir Hossain, Ashu Gupta, Tom Gedeon

TL;DR

Problem: racial bias in medical image segmentation stems from demographic imbalances in data collection, limiting fairness across minority groups. Approach: an unsupervised training-set search constructs a minority-aligned subset by clustering the data into $K$ groups, evaluating $FID(T,S_k)$ for each, and selecting samples via weights derived from $- ext{FID}$, while a minority-specific segmentation model (SAMed with LoRA) is trained on the selected subset. Findings: greedy subset selection reduces domain gap and improves Dice/IoU for minority groups on the FairSeg SLO fundus dataset, with validation against random sampling and a total search cost around 900 seconds. Significance: this label-free data curation strategy supports fairer clinical AI outcomes and could generalize to other medical imaging tasks, underscoring the importance of diverse, representative data sources and targeted model adaptation.

Abstract

This article investigates the critical issue of dataset bias in medical imaging, with a particular emphasis on racial disparities caused by uneven population distribution in dataset collection. Our analysis reveals that medical segmentation datasets are significantly biased, primarily influenced by the demographic composition of their collection sites. For instance, Scanning Laser Ophthalmoscopy (SLO) fundus datasets collected in the United States predominantly feature images of White individuals, with minority racial groups underrepresented. This imbalance can result in biased model performance and inequitable clinical outcomes, particularly for minority populations. To address this challenge, we propose a novel training set search strategy aimed at reducing these biases by focusing on underrepresented racial groups. Our approach utilizes existing datasets and employs a simple greedy algorithm to identify source images that closely match the target domain distribution. By selecting training data that aligns more closely with the characteristics of minority populations, our strategy improves the accuracy of medical segmentation models on specific minorities, i.e., Black. Our experimental results demonstrate the effectiveness of this approach in mitigating bias. We also discuss the broader societal implications, highlighting how addressing these disparities can contribute to more equitable healthcare outcomes.

Unsupervised Search for Ethnic Minorities' Medical Segmentation Training Set

TL;DR

groups, evaluating

for each, and selecting samples via weights derived from

, while a minority-specific segmentation model (SAMed with LoRA) is trained on the selected subset. Findings: greedy subset selection reduces domain gap and improves Dice/IoU for minority groups on the FairSeg SLO fundus dataset, with validation against random sampling and a total search cost around 900 seconds. Significance: this label-free data curation strategy supports fairer clinical AI outcomes and could generalize to other medical imaging tasks, underscoring the importance of diverse, representative data sources and targeted model adaptation.

Abstract

Paper Structure (11 sections, 2 equations, 4 figures, 1 table, 1 algorithm)

This paper contains 11 sections, 2 equations, 4 figures, 1 table, 1 algorithm.

Introduction
Related Work
Method
Motivation
The Minority-specific Search Algorithm
The Minority-specific Segmentation Model
Experimental
Settings
Results
Discussion
Conclusion

Figures (4)

Figure 1: Image samples and race composition statistics of the SLO fundus dataset. Left columns present the target samples, while the pie chart illustrates the unbalanced distribution across races.
Figure 2: The proposed minority-specific training set search algorithm consists of three steps. First, the data pool is partitioned into K clusters. Second, the distributional differences between the clusters and the target are computed. Third, sampling scores are calculated, and the searched training set is constructed based on these scores.
Figure 3: SLO fundus segmentation task using SAMed zhang2023customized, illustrating the segmentation of the optic cup and disk with corresponding masks applied.
Figure 4: Impact of the number of clusters K to the domain gap between searched and target.

Unsupervised Search for Ethnic Minorities' Medical Segmentation Training Set

TL;DR

Abstract

Unsupervised Search for Ethnic Minorities' Medical Segmentation Training Set

Authors

TL;DR

Abstract

Table of Contents

Figures (4)