Table of Contents
Fetching ...

Unleashing the Strengths of Unlabeled Data in Pan-cancer Abdominal Organ Quantification: the FLARE22 Challenge

Jun Ma, Yao Zhang, Song Gu, Cheng Ge, Shihao Ma, Adamo Young, Cheng Zhu, Kangkang Meng, Xin Yang, Ziyan Huang, Fan Zhang, Wentao Liu, YuanKe Pan, Shoujin Huang, Jiacheng Wang, Mingze Sun, Weixin Xu, Dengqiang Jia, Jae Won Choi, Natália Alves, Bram de Wilde, Gregor Koehler, Yajun Wu, Manuel Wiesenfarth, Qiongjie Zhu, Guoqiang Dong, Jian He, the FLARE Challenge Consortium, Bo Wang

TL;DR

The FLARE 2022 Challenge was organized to benchmark fast, low-resource, accurate, annotation-efficient, and generalized AI algorithms and independently validated that a set of AI algorithms achieved a median Dice Similarity Coefficient (DSC) of 90.0% by using 50 labeled scans and 2000 unlabeled scans, which can significantly reduce annotation requirements.

Abstract

Quantitative organ assessment is an essential step in automated abdominal disease diagnosis and treatment planning. Artificial intelligence (AI) has shown great potential to automatize this process. However, most existing AI algorithms rely on many expert annotations and lack a comprehensive evaluation of accuracy and efficiency in real-world multinational settings. To overcome these limitations, we organized the FLARE 2022 Challenge, the largest abdominal organ analysis challenge to date, to benchmark fast, low-resource, accurate, annotation-efficient, and generalized AI algorithms. We constructed an intercontinental and multinational dataset from more than 50 medical groups, including Computed Tomography (CT) scans with different races, diseases, phases, and manufacturers. We independently validated that a set of AI algorithms achieved a median Dice Similarity Coefficient (DSC) of 90.0\% by using 50 labeled scans and 2000 unlabeled scans, which can significantly reduce annotation requirements. The best-performing algorithms successfully generalized to holdout external validation sets, achieving a median DSC of 89.5\%, 90.9\%, and 88.3\% on North American, European, and Asian cohorts, respectively. They also enabled automatic extraction of key organ biology features, which was labor-intensive with traditional manual measurements. This opens the potential to use unlabeled data to boost performance and alleviate annotation shortages for modern AI models.

Unleashing the Strengths of Unlabeled Data in Pan-cancer Abdominal Organ Quantification: the FLARE22 Challenge

TL;DR

The FLARE 2022 Challenge was organized to benchmark fast, low-resource, accurate, annotation-efficient, and generalized AI algorithms and independently validated that a set of AI algorithms achieved a median Dice Similarity Coefficient (DSC) of 90.0% by using 50 labeled scans and 2000 unlabeled scans, which can significantly reduce annotation requirements.

Abstract

Quantitative organ assessment is an essential step in automated abdominal disease diagnosis and treatment planning. Artificial intelligence (AI) has shown great potential to automatize this process. However, most existing AI algorithms rely on many expert annotations and lack a comprehensive evaluation of accuracy and efficiency in real-world multinational settings. To overcome these limitations, we organized the FLARE 2022 Challenge, the largest abdominal organ analysis challenge to date, to benchmark fast, low-resource, accurate, annotation-efficient, and generalized AI algorithms. We constructed an intercontinental and multinational dataset from more than 50 medical groups, including Computed Tomography (CT) scans with different races, diseases, phases, and manufacturers. We independently validated that a set of AI algorithms achieved a median Dice Similarity Coefficient (DSC) of 90.0\% by using 50 labeled scans and 2000 unlabeled scans, which can significantly reduce annotation requirements. The best-performing algorithms successfully generalized to holdout external validation sets, achieving a median DSC of 89.5\%, 90.9\%, and 88.3\% on North American, European, and Asian cohorts, respectively. They also enabled automatic extraction of key organ biology features, which was labor-intensive with traditional manual measurements. This opens the potential to use unlabeled data to boost performance and alleviate annotation shortages for modern AI models.
Paper Structure (19 sections, 4 equations, 4 figures, 1 table)

This paper contains 19 sections, 4 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Overview of the challenge design.a, The challenge aims to benchmark automatic algorithms that can simultaneously segment 13 abdominal organs. Organs have different sizes, morphologies, and appearances, featuring representative difficulties in medical image analysis tasks. b, The challenge contains two phases. During the development phase, participants develop automatic segmentation algorithms based on 2000 unlabeled cases and 50 labeled cases. The algorithms can be evaluated on the tuning set and the online evaluation platform will return the quantitative performance to participants. During the validation phase, each participant team can submit one algorithm via the docker container as the final solution, which is independently evaluated on the internal validation set to obtain ranking results. The top teams are selected for post-challenge analysis, which are further evaluated on three independent intercontinental cohorts to validate their generalization ability. c, The data sources are multinational and the challenge has attracted more than 100 worldwide participants (the circle size is proportional to the number of participants in each country). d, The FLARE challenge dataset is significantly larger than the previous abdominal organ segmentation challenge datasets. Distribution of key algorithm designs: e network architecture, f, loss function, and g, optimizer.
  • Figure 2: Performance analysis on the internal validation set.a, The comparisons of using unlabeled data and without using unlabeled data for top-performing algorithms show that using unlabeled data can significantly improve the performance. The average improvement of the Dice Similarity Coefficient (DSC) score is 9.8%. b, The top three best-performing algorithms achieve a good trade-off between segmentation accuracy (y-axis) and efficiency (x-axis). The circle size is proportional to GPU memory consumption. The 14 top-performing algorithms are marked in the figure. c, The performance comparisons among different dimensions are presented for the top three best-performing algorithms and another three top-performing algorithms with the best DSC, running time, and CPU utilization metrics, respectively. The value denotes the number of teams surpassed by each algorithm in each dimension. d, The bootstrap distribution of rankings (N=1000) shows that the ranking scheme is stable with respect to sampling variability.
  • Figure 3: Dot and box plots of the Dice Similarity Coefficient (DSC) values of top-performing algorithms for the 13 organs on the interval validation set. The box plots display descriptive statistics across all internal validation cases, with the median value represented by the black horizontal line within the box, the lower and upper quartiles delineating the borders of the box, and the vertical black lines indicating the 1.5 interquartile range. The algorithms are ranked on the x-axis based on their median DSC scores.
  • Figure 4: Performance on three external validation sets.a, The segmentation performance (Dice similarity coefficient, DSC) on the North American (NAM.) cohort, European (Eur.) cohort, and Asian cohort. For each cohort, the DSC scores between using unlabeled data and without using unlabeled data are presented as well. b, Visualized segmentation examples of the two top algorithms show that using unlabeled data can significantly improve the segmentation quality. c-e, The segmentation performance of the best-accuracy algorithm (aladdin5) and the best-performing algorithm (blackbean) across demographics, including genders, ages, and manufacturers. f, Pearson's correlation contour plots of the organ volume demonstrate that the two top algorithms accurately quantify the liver and spleen volumes, which are important clinical biomarkers.