Table of Contents
Fetching ...

The Medical Segmentation Decathlon

Michela Antonelli, Annika Reinke, Spyridon Bakas, Keyvan Farahani, AnnetteKopp-Schneider, Bennett A. Landman, Geert Litjens, Bjoern Menze, Olaf Ronneberger, Ronald M. Summers, Bram van Ginneken, Michel Bilello, Patrick Bilic, Patrick F. Christ, Richard K. G. Do, Marc J. Gollub, Stephan H. Heckers, Henkjan Huisman, William R. Jarnagin, Maureen K. McHugo, Sandy Napel, Jennifer S. Goli Pernicka, Kawal Rhode, Catalina Tobon-Gomez, Eugene Vorontsov, Henkjan Huisman, James A. Meakin, Sebastien Ourselin, Manuel Wiesenfarth, Pablo Arbelaez, Byeonguk Bae, Sihong Chen, Laura Daza, Jianjiang Feng, Baochun He, Fabian Isensee, Yuanfeng Ji, Fucang Jia, Namkug Kim, Ildoo Kim, Dorit Merhof, Akshay Pai, Beomhee Park, Mathias Perslev, Ramin Rezaiifar, Oliver Rippel, Ignacio Sarasua, Wei Shen, Jaemin Son, Christian Wachinger, Liansheng Wang, Yan Wang, Yingda Xia, Daguang Xu, Zhanwei Xu, Yefeng Zheng, Amber L. Simpson, Lena Maier-Hein, M. Jorge Cardoso

TL;DR

The paper introduces the Medical Segmentation Decathlon (MSD) as an international benchmark to test whether a single general-purpose segmentation algorithm can perform well across ten diverse tasks and modalities, addressing the need for scalable, generalizable medical image analysis. It implements a two-phase challenge (development and mystery) with ten heterogeneous datasets and evaluates generalizability using cross-task performance and significance-based rankings, monitored with bootstrapping and stability analyses. The results show that state-of-the-art segmentation methods are mature, with nnU-Net emerging as a highly generalizable learner that maintains high performance across unseen tasks and over time, corroborated by its continued success in subsequent challenges. The MSD demonstrates that automated, task-agnostic segmentation pipelines can reach top-tier performance without extensive task-specific tuning, enabling broader participation by non-AI experts and accelerating clinical translation of segmentation tools.

Abstract

International challenges have become the de facto standard for comparative assessment of image analysis algorithms given a specific task. Segmentation is so far the most widely investigated medical image processing task, but the various segmentation challenges have typically been organized in isolation, such that algorithm development was driven by the need to tackle a single specific clinical problem. We hypothesized that a method capable of performing well on multiple tasks will generalize well to a previously unseen task and potentially outperform a custom-designed solution. To investigate the hypothesis, we organized the Medical Segmentation Decathlon (MSD) - a biomedical image analysis challenge, in which algorithms compete in a multitude of both tasks and modalities. The underlying data set was designed to explore the axis of difficulties typically encountered when dealing with medical images, such as small data sets, unbalanced labels, multi-site data and small objects. The MSD challenge confirmed that algorithms with a consistent good performance on a set of tasks preserved their good average performance on a different set of previously unseen tasks. Moreover, by monitoring the MSD winner for two years, we found that this algorithm continued generalizing well to a wide range of other clinical problems, further confirming our hypothesis. Three main conclusions can be drawn from this study: (1) state-of-the-art image segmentation algorithms are mature, accurate, and generalize well when retrained on unseen tasks; (2) consistent algorithmic performance across multiple tasks is a strong surrogate of algorithmic generalizability; (3) the training of accurate AI segmentation models is now commoditized to non AI experts.

The Medical Segmentation Decathlon

TL;DR

The paper introduces the Medical Segmentation Decathlon (MSD) as an international benchmark to test whether a single general-purpose segmentation algorithm can perform well across ten diverse tasks and modalities, addressing the need for scalable, generalizable medical image analysis. It implements a two-phase challenge (development and mystery) with ten heterogeneous datasets and evaluates generalizability using cross-task performance and significance-based rankings, monitored with bootstrapping and stability analyses. The results show that state-of-the-art segmentation methods are mature, with nnU-Net emerging as a highly generalizable learner that maintains high performance across unseen tasks and over time, corroborated by its continued success in subsequent challenges. The MSD demonstrates that automated, task-agnostic segmentation pipelines can reach top-tier performance without extensive task-specific tuning, enabling broader participation by non-AI experts and accelerating clinical translation of segmentation tools.

Abstract

International challenges have become the de facto standard for comparative assessment of image analysis algorithms given a specific task. Segmentation is so far the most widely investigated medical image processing task, but the various segmentation challenges have typically been organized in isolation, such that algorithm development was driven by the need to tackle a single specific clinical problem. We hypothesized that a method capable of performing well on multiple tasks will generalize well to a previously unseen task and potentially outperform a custom-designed solution. To investigate the hypothesis, we organized the Medical Segmentation Decathlon (MSD) - a biomedical image analysis challenge, in which algorithms compete in a multitude of both tasks and modalities. The underlying data set was designed to explore the axis of difficulties typically encountered when dealing with medical images, such as small data sets, unbalanced labels, multi-site data and small objects. The MSD challenge confirmed that algorithms with a consistent good performance on a set of tasks preserved their good average performance on a different set of previously unseen tasks. Moreover, by monitoring the MSD winner for two years, we found that this algorithm continued generalizing well to a wide range of other clinical problems, further confirming our hypothesis. Three main conclusions can be drawn from this study: (1) state-of-the-art image segmentation algorithms are mature, accurate, and generalize well when retrained on unseen tasks; (2) consistent algorithmic performance across multiple tasks is a strong surrogate of algorithmic generalizability; (3) the training of accurate AI segmentation models is now commoditized to non AI experts.

Paper Structure

This paper contains 42 sections, 17 figures, 13 tables.

Figures (17)

  • Figure 1: Overview of the ten different tasks of the Medical Segmentation Decathlon (MSD). The challenge comprised different target regions, modalities and challenging characteristics and was separated into seven known tasks (blue; the development phase) and three mystery tasks (gray; the mystery phase). Used abbreviations: MRI —- magnetic resonance imaging, mp-MRI —- multiparametric-magnetic resonance imaging, CT —- computed tomography.
  • Figure 2: Base network architectures (left) and loss functions (right) used by the participants of the 2018 Decathlon challenge who provided full algorithmic information.
  • Figure 3: Dot- and box-plots of the DSC values of all participating algorithms for the seven tasks of the development phase, color-coded by the target regions. box-plots represent descriptive statistics over all test cases. The median value is shown by the black horizontal line within the box, the first and third quartiles as the lower and upper border of the box, respectively, and the 1.5 interquartile range by the vertical black lines. Outliers are shown as black circles. The raw DSC values are provided as gray circles. Used abbreviations: PZ---peripheral zone, TZ---transition zone.
  • Figure 4: Dot- and box-plots of the DSC values of all participating algorithms for the three tasks of the mystery phase, color-coded by the target regions. box-plots represent descriptive statistics over all test cases. The median value is shown by the black horizontal line within the box, the first and third quartiles as the lower and upper border of the box, respectively, and the 1.5 interquartile range by the vertical black lines. Outliers are shown as black circles. The raw DSC values are provided as gray circles.
  • Figure 5: box-plots of ranks for all participating algorithms over all seven tasks and thirteen target regions of the development phase (red) and all three tasks and four target regions of the mystery phase (blue). The median value is shown by the black vertical line within the box, the first and third quartiles as the lower and upper border of the box, respectively, and the 1.5 interquartile range by the horizontal black lines. Individual ranks are shown as gray circles.
  • ...and 12 more figures