Table of Contents
Fetching ...

The MAMA-MIA Challenge: Advancing Generalizability and Fairness in Breast MRI Tumor Segmentation and Treatment Response Prediction

Lidia Garrucho, Smriti Joshi, Kaisar Kushibar, Richard Osuala, Maciej Bobowicz, Xavier Bargalló, Paulius Jaruševičius, Kai Geissler, Raphael Schäfer, Muhammad Alberb, Tony Xu, Anne Martel, Daniel Sleiman, Navchetan Awasthi, Hadeel Awwad, Joan C. Vilanova, Robert Martí, Daan Schouten, Jeong Hoon Lee, Mirabela Rusu, Eleonora Poeta, Luisa Vargas, Eliana Pastor, Maria A. Zuluaga, Jessica Kächele, Dimitrios Bounias, Alexandra Ertl, Katarzyna Gwoździewicz, Maria-Laura Cosaka, Pasant M. Abo-Elhoda, Sara W. Tantawy, Shorouq S. Sakrana, Norhan O. Shawky-Abdelfatah, Amr Muhammad Abdo-Salem, Androniki Kozana, Eugen Divjak, Gordana Ivanac, Katerina Nikiforaki, Michail E. Klontzas, Rosa García-Dosdá, Meltem Gulsun-Akpinar, Oğuz Lafcı, Carlos Martín-Isla, Oliver Díaz, Laura Igual, Karim Lekadir

TL;DR

The MAMA-MIA Challenge provides standardized datasets, evaluation protocols, and public resources to promote the development of robust and equitable artificial intelligence systems for breast cancer imaging and demonstrates substantial performance variability under external testing.

Abstract

Breast cancer is the most frequently diagnosed malignancy among women worldwide and a leading cause of cancer-related mortality. Dynamic contrast-enhanced magnetic resonance imaging plays a central role in tumor characterization and treatment monitoring, particularly in patients receiving neoadjuvant chemotherapy. However, existing artificial intelligence models for breast magnetic resonance imaging are often developed using single-center data and evaluated using aggregate performance metrics, limiting their generalizability and obscuring potential performance disparities across demographic subgroups. The MAMA-MIA Challenge was designed to address these limitations by introducing a large-scale benchmark that jointly evaluates primary tumor segmentation and prediction of pathologic complete response using pre-treatment magnetic resonance imaging only. The training cohort comprised 1,506 patients from multiple institutions in the United States, while evaluation was conducted on an external test set of 574 patients from three independent European centers to assess cross-continental and cross-institutional generalization. A unified scoring framework combined predictive performance with subgroup consistency across age, menopausal status, and breast density. Twenty-six international teams participated in the final evaluation phase. Results demonstrate substantial performance variability under external testing and reveal trade-offs between overall accuracy and subgroup fairness. The challenge provides standardized datasets, evaluation protocols, and public resources to promote the development of robust and equitable artificial intelligence systems for breast cancer imaging.

The MAMA-MIA Challenge: Advancing Generalizability and Fairness in Breast MRI Tumor Segmentation and Treatment Response Prediction

TL;DR

The MAMA-MIA Challenge provides standardized datasets, evaluation protocols, and public resources to promote the development of robust and equitable artificial intelligence systems for breast cancer imaging and demonstrates substantial performance variability under external testing.

Abstract

Breast cancer is the most frequently diagnosed malignancy among women worldwide and a leading cause of cancer-related mortality. Dynamic contrast-enhanced magnetic resonance imaging plays a central role in tumor characterization and treatment monitoring, particularly in patients receiving neoadjuvant chemotherapy. However, existing artificial intelligence models for breast magnetic resonance imaging are often developed using single-center data and evaluated using aggregate performance metrics, limiting their generalizability and obscuring potential performance disparities across demographic subgroups. The MAMA-MIA Challenge was designed to address these limitations by introducing a large-scale benchmark that jointly evaluates primary tumor segmentation and prediction of pathologic complete response using pre-treatment magnetic resonance imaging only. The training cohort comprised 1,506 patients from multiple institutions in the United States, while evaluation was conducted on an external test set of 574 patients from three independent European centers to assess cross-continental and cross-institutional generalization. A unified scoring framework combined predictive performance with subgroup consistency across age, menopausal status, and breast density. Twenty-six international teams participated in the final evaluation phase. Results demonstrate substantial performance variability under external testing and reveal trade-offs between overall accuracy and subgroup fairness. The challenge provides standardized datasets, evaluation protocols, and public resources to promote the development of robust and equitable artificial intelligence systems for breast cancer imaging.
Paper Structure (33 sections, 7 equations, 8 figures, 5 tables)

This paper contains 33 sections, 7 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Fairness plotted against performance across the two MAMA-MIA benchmark tasks. Teams with fairness-performance tradeoffs within the area indicated in green improve upon the baseline tradeoff.
  • Figure 2: Distribution of DSC across the three participating centers (GUM, KAU, HCB) for the top 5 performing teams in Task 1. Each boxplot represents the per-case DSC distribution for a team, in order of their ranking. Horizontal dashed lines indicate the average DSC across all top 5 teams for each center.
  • Figure 3: Analysis of tumor size distribution and its impact on segmentation performance. (a) Histogram of tumor volumes ($mm^3$) on a logarithmic scale, with dashed lines indicating the 20th and 80th percentile thresholds used to categorize tumors as Small, Moderate, or Large. (b) Percentage distribution of tumor size categories across the three participating centers (GUM, KAU, HCB), with boxed values showing the average Dice Similarity Coefficient (DSC) for the top 5 teams in each category. (c) Comparison of DSC distributions between the top 5 and bottom 5 performing teams across tumor size categories, with gap bars ($\Delta$) indicating the difference in mean DSC.
  • Figure 4: Subgroup analysis of segmentation performance. Boxplots show DSC distributions for the top 5 (blue) and bottom 5 (red) performing teams, stratified by (a) breast density, (b) age group, and (c) menopausal status. The performance of high-ranking methods consistently improve in comparison to the low-ranking methods. There is no substantial differences between subgroups in all fairness categories.
  • Figure 5: Recall comparison between pCR (minority) and not pCR (majority) classes across participating teams. $\Delta$ denotes the recall gap, reflecting class-wise prediction bias.
  • ...and 3 more figures