Table of Contents
Fetching ...

Investigating Label Bias and Representational Sources of Age-Related Disparities in Medical Segmentation

Aditya Parikh, Sneha Das, Aasa Feragen

TL;DR

The study investigates age-related disparities in breast cancer segmentation and distinguishes label bias from representational bias using the MAMA-MIA dataset. Through a bias-diagnosis framework and controlled experiments, it demonstrates that label bias in automated annotations can amplify fairness gaps (the Biased Ruler effect) and that representational differences—such as larger, more variable tumors in younger patients—contribute to intrinsic learning difficulties. The results show that balancing data by difficulty or swapping high-quality labels does not eliminate disparities, while training on biased labels worsens bias, highlighting the need for qualitative distributional interventions and rigorous auditing of automated annotation pipelines. Practically, the work provides a framework for diagnosing segmentation bias and argues that achieving fairness requires addressing qualitative distributional differences rather than merely balancing case counts, with implications for clinical deployment and regulatory guidelines.

Abstract

Algorithmic bias in medical imaging can perpetuate health disparities, yet its causes remain poorly understood in segmentation tasks. While fairness has been extensively studied in classification, segmentation remains underexplored despite its clinical importance. In breast cancer segmentation, models exhibit significant performance disparities against younger patients, commonly attributed to physiological differences in breast density. We audit the MAMA-MIA dataset, establishing a quantitative baseline of age-related bias in its automated labels, and reveal a critical Biased Ruler effect where systematically flawed labels for validation misrepresent a model's actual bias. However, whether this bias originates from lower-quality annotations (label bias) or from fundamentally more challenging image characteristics remains unclear. Through controlled experiments, we systematically refute hypotheses that the bias stems from label quality sensitivity or quantitative case difficulty imbalance. Balancing training data by difficulty fails to mitigate the disparity, revealing that younger patient cases are intrinsically harder to learn. We provide direct evidence that systemic bias is learned and amplified when training on biased, machine-generated labels, a critical finding for automated annotation pipelines. This work introduces a systematic framework for diagnosing algorithmic bias in medical segmentation and demonstrates that achieving fairness requires addressing qualitative distributional differences rather than merely balancing case counts.

Investigating Label Bias and Representational Sources of Age-Related Disparities in Medical Segmentation

TL;DR

The study investigates age-related disparities in breast cancer segmentation and distinguishes label bias from representational bias using the MAMA-MIA dataset. Through a bias-diagnosis framework and controlled experiments, it demonstrates that label bias in automated annotations can amplify fairness gaps (the Biased Ruler effect) and that representational differences—such as larger, more variable tumors in younger patients—contribute to intrinsic learning difficulties. The results show that balancing data by difficulty or swapping high-quality labels does not eliminate disparities, while training on biased labels worsens bias, highlighting the need for qualitative distributional interventions and rigorous auditing of automated annotation pipelines. Practically, the work provides a framework for diagnosing segmentation bias and argues that achieving fairness requires addressing qualitative distributional differences rather than merely balancing case counts, with implications for clinical deployment and regulatory guidelines.

Abstract

Algorithmic bias in medical imaging can perpetuate health disparities, yet its causes remain poorly understood in segmentation tasks. While fairness has been extensively studied in classification, segmentation remains underexplored despite its clinical importance. In breast cancer segmentation, models exhibit significant performance disparities against younger patients, commonly attributed to physiological differences in breast density. We audit the MAMA-MIA dataset, establishing a quantitative baseline of age-related bias in its automated labels, and reveal a critical Biased Ruler effect where systematically flawed labels for validation misrepresent a model's actual bias. However, whether this bias originates from lower-quality annotations (label bias) or from fundamentally more challenging image characteristics remains unclear. Through controlled experiments, we systematically refute hypotheses that the bias stems from label quality sensitivity or quantitative case difficulty imbalance. Balancing training data by difficulty fails to mitigate the disparity, revealing that younger patient cases are intrinsically harder to learn. We provide direct evidence that systemic bias is learned and amplified when training on biased, machine-generated labels, a critical finding for automated annotation pipelines. This work introduces a systematic framework for diagnosing algorithmic bias in medical segmentation and demonstrates that achieving fairness requires addressing qualitative distributional differences rather than merely balancing case counts.

Paper Structure

This paper contains 7 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Inherent age-related bias in the Silver-Standard automated labels. An OLS regression reveals a positive correlation between patient age and segmentation performance (Dice Score), establishing a quantitative baseline of disparity.
  • Figure 2: (a) Tumor volume is, on average, larger and has higher variance in the Young group ($p<0.01$ for Y-O). (b) In contrast, basic tumor shape metrics (sphericity and elongation) show no statistically significant difference.
  • Figure 3: Inspection of subgroup distribution shifts in the feature space projection. t-SNE (t-distributed stochastic neighbor embedding) maaten2008visualizing embeddings (for representative fold 0). Left: Scatter plot of the first two t-SNE dimensions. Right: Density distribution of the first t-SNE dimension. Both plots indicate a strong overlap, suggesting the model’s latent space does not strongly separate representations by age. Clustering metrics quantify this overlap across 5-folds (Mean$\pm$Std): For t-SNE, Silhouette = $-0.0312 \pm 0.0112$, Purity = $0.3943 \pm 0.0193$, ARI (Adjusted Rand Index; [-1, 1], higher is better agreement) = $0.0033 \pm 0.0113$, NMI (Normalized Mutual Information; [0, 1], higher is better agreement) = $0.0126 \pm 0.0098$. Low ARI and NMI values confirm poor correspondence between the embedding structure and true age groups. t-SNE parameters: perplexity $\approx n/10$ (clipped to 5–50), LR 200, iters 1500, init = PCA.