Table of Contents
Fetching ...

Medical Imaging AI Competitions Lack Fairness

Annika Reinke, Evangelia Christodoulou, Sthuthi Sadananda, A. Emre Kavur, Khrystyna Faryna, Daan Schouten, Bennett A. Landman, Carole Sudre, Olivier Colliot, Nick Heller, Sophie Loizillon, Martin Maška, Maëlys Solal, Arya Yazdan-Panah, Vilma Bozgo, Ömer Sümer, Siem de Jong, Sophie Fischer, Michal Kozubek, Tim Rädsch, Nadim Hammoud, Fruzsina Molnár-Gábor, Steven Hicks, Michael A. Riegler, Anindo Saha, Vajira Thambawita, Pal Halvorsen, Amelia Jiménez-Sánchez, Qingyang Yang, Veronika Cheplygina, Sabrina Bottazzi, Alexander Seitel, Spyridon Bakas, Alexandros Karargyris, Kiran Vaidhya Venkadesh, Bram van Ginneken, Lena Maier-Hein

TL;DR

This study interrogates fairness in biomedical imaging AI benchmarking by examining representativeness and reuse under FAIR. Through a large-scale analysis of 241 challenges across 19 modalities, it identifies substantial geographic, modality, and task biases, along with pervasive licensing ambiguities and documentation gaps that impede reproducibility. The authors argue that benchmark success often misaligns with clinical relevance due to these gaps and propose steps toward machine-readable metadata, standardized licensing, and baseline data governance. Overall, the work highlights systemic weaknesses in current challenges and calls for standards to ensure benchmarks meaningfully translate to real-world clinical practice.

Abstract

Benchmarking competitions are central to the development of artificial intelligence (AI) in medical imaging, defining performance standards and shaping methodological progress. However, it remains unclear whether these benchmarks provide data that are sufficiently representative, accessible, and reusable to support clinically meaningful AI. In this work, we assess fairness along two complementary dimensions: (1) whether challenge datasets are representative of real-world clinical diversity, and (2) whether they are accessible and legally reusable in line with the FAIR principles. To address this question, we conducted a large-scale systematic study of 241 biomedical image analysis challenges comprising 458 tasks across 19 imaging modalities. Our findings show substantial biases in dataset composition, including geographic location, modality-, and problem type-related biases, indicating that current benchmarks do not adequately reflect real-world clinical diversity. Despite their widespread influence, challenge datasets were frequently constrained by restrictive or ambiguous access conditions, inconsistent or non-compliant licensing practices, and incomplete documentation, limiting reproducibility and long-term reuse. Together, these shortcomings expose foundational fairness limitations in our benchmarking ecosystem and highlight a disconnect between leaderboard success and clinical relevance.

Medical Imaging AI Competitions Lack Fairness

TL;DR

This study interrogates fairness in biomedical imaging AI benchmarking by examining representativeness and reuse under FAIR. Through a large-scale analysis of 241 challenges across 19 modalities, it identifies substantial geographic, modality, and task biases, along with pervasive licensing ambiguities and documentation gaps that impede reproducibility. The authors argue that benchmark success often misaligns with clinical relevance due to these gaps and propose steps toward machine-readable metadata, standardized licensing, and baseline data governance. Overall, the work highlights systemic weaknesses in current challenges and calls for standards to ensure benchmarks meaningfully translate to real-world clinical practice.

Abstract

Benchmarking competitions are central to the development of artificial intelligence (AI) in medical imaging, defining performance standards and shaping methodological progress. However, it remains unclear whether these benchmarks provide data that are sufficiently representative, accessible, and reusable to support clinically meaningful AI. In this work, we assess fairness along two complementary dimensions: (1) whether challenge datasets are representative of real-world clinical diversity, and (2) whether they are accessible and legally reusable in line with the FAIR principles. To address this question, we conducted a large-scale systematic study of 241 biomedical image analysis challenges comprising 458 tasks across 19 imaging modalities. Our findings show substantial biases in dataset composition, including geographic location, modality-, and problem type-related biases, indicating that current benchmarks do not adequately reflect real-world clinical diversity. Despite their widespread influence, challenge datasets were frequently constrained by restrictive or ambiguous access conditions, inconsistent or non-compliant licensing practices, and incomplete documentation, limiting reproducibility and long-term reuse. Together, these shortcomings expose foundational fairness limitations in our benchmarking ecosystem and highlight a disconnect between leaderboard success and clinical relevance.

Paper Structure

This paper contains 16 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview of biomedical image analysis challenges. (a) Number of challenges and tasks per year. (b) Body region. (c) Challenge edition, i.e., was the challenge newly introduced or an iteration of previous versions. (d) Challenge venue or conference. (e) Sample size of training and test datasets.
  • Figure 2: Medical imaging AI challenges are biased with respect to (a) geographical origin (Northern America, China, Europe), (b) problem category (segmentation), and (c) imaging modality (Magnetic Resonance Imaging; MRI).
  • Figure 3: Problematic practices in data licensing and access conditions undermine expectations associated with the FAIR principles of Accessibility and Reusability. Results are based on tasks with standard licenses or clearly interpretable licensing information (n = 398). Tasks were categorized according to observed licensing and access practices, ranging from correct implementations to unclear, misleading, and potentially non-compliant cases. Note that tasks may exhibit multiple types of issues, resulting in percentages exceeding 100%.
  • Figure 4: Reporting quality related to data aspects in medical imaging AI competitions is poor. Stacked barplots show the percentages of tasks in which key data aspects were described in sufficient detail (green), partly described (yellow), or not described at all (red).
  • Figure SN1: Distribution of data-sharing scores across tasks. Data-sharing quality was assessed on a 0–5 scale following the (Re)usable Data Project (RDP) framework carbon2019analysis, where higher scores indicate clearer licensing terms, greater data accessibility, and fewer restrictions on reuse and redistribution. Bars show the number of tasks assigned to each score.