Who Does Your Algorithm Fail? Investigating Age and Ethnic Bias in the MAMA-MIA Dataset
Aditya Parikh, Sneha Das, Aasa Feragen
TL;DR
This study audits fairness in automated breast tumor segmentation using the multi-center MAMA-MIA dataset, revealing intrinsic age-related bias against younger patients and strong, data-source–dependent ethnic biases that are masked by pooling. By comparing silver model masks to expert gold labels with metrics such as Dice and HD95 and applying demographic disparity measures, the authors show that aggregation can obscure injustices and that bias persists even in age-balanced training. The work contributes a systematic framework for fairness auditing in medical image segmentation, underscores the influence of center-specific factors, and calls for targeted investigations into annotation quality and mitigation strategies to improve equitable clinical outcomes.
Abstract
Deep learning models aim to improve diagnostic workflows, but fairness evaluation remains underexplored beyond classification, e.g., in image segmentation. Unaddressed segmentation bias can lead to disparities in the quality of care for certain populations, potentially compounded across clinical decision points and amplified through iterative model development. Here, we audit the fairness of the automated segmentation labels provided in the breast cancer tumor segmentation dataset MAMA-MIA. We evaluate automated segmentation quality across age, ethnicity, and data source. Our analysis reveals an intrinsic age-related bias against younger patients that continues to persist even after controlling for confounding factors, such as data source. We hypothesize that this bias may be linked to physiological factors, a known challenge for both radiologists and automated systems. Finally, we show how aggregating data from multiple data sources influences site-specific ethnic biases, underscoring the necessity of investigating data at a granular level.
