The ULS23 Challenge: a Baseline Model and Benchmark Dataset for 3D Universal Lesion Segmentation in Computed Tomography

M. J. J. de Grauw; E. Th. Scholten; E. J. Smit; M. J. C. M. Rutten; M. Prokop; B. van Ginneken; A. Hering

The ULS23 Challenge: a Baseline Model and Benchmark Dataset for 3D Universal Lesion Segmentation in Computed Tomography

M. J. J. de Grauw, E. Th. Scholten, E. J. Smit, M. J. C. M. Rutten, M. Prokop, B. van Ginneken, A. Hering

TL;DR

The ULS23 benchmark for 3D universal lesion segmentation in chest-abdomen-pelvis CT examinations is introduced and a baseline semi-supervised 3D lesion segmentation model is developed and publicly released.

Abstract

Size measurements of tumor manifestations on follow-up CT examinations are crucial for evaluating treatment outcomes in cancer patients. Efficient lesion segmentation can speed up these radiological workflows. While numerous benchmarks and challenges address lesion segmentation in specific organs like the liver, kidneys, and lungs, the larger variety of lesion types encountered in clinical practice demands a more universal approach. To address this gap, we introduced the ULS23 benchmark for 3D universal lesion segmentation in chest-abdomen-pelvis CT examinations. The ULS23 training dataset contains 38,693 lesions across this region, including challenging pancreatic, colon and bone lesions. For evaluation purposes, we curated a dataset comprising 775 lesions from 284 patients. Each of these lesions was identified as a target lesion in a clinical context, ensuring diversity and clinical relevance within this dataset. The ULS23 benchmark is publicly accessible via uls23.grand-challenge.org, enabling researchers worldwide to assess the performance of their segmentation methods. Furthermore, we have developed and publicly released our baseline semi-supervised 3D lesion segmentation model. This model achieved an average Dice coefficient of 0.703 $\pm$ 0.240 on the challenge test set. We invite ongoing submissions to advance the development of future ULS models.

The ULS23 Challenge: a Baseline Model and Benchmark Dataset for 3D Universal Lesion Segmentation in Computed Tomography

TL;DR

Abstract

0.240 on the challenge test set. We invite ongoing submissions to advance the development of future ULS models.

Paper Structure (30 sections, 2 equations, 12 figures, 5 tables)

This paper contains 30 sections, 2 equations, 12 figures, 5 tables.

Introduction
Related Work
Related Biomedical Grand Challenges
The nnUnet Framework
DeepLesion
Universal Lesion Segmentation
Materials
The ULS23 Dataset
Training - Fully-Annotated Datasets
Training - Partially-Annotated Datasets
Test Dataset
Validation Dataset
Challenge Design
Participation Requirements
Timeline & Results
...and 15 more sections

Figures (12)

Figure 1: Histograms depicting the long- and short-axis measurements in millimeters for various lesion types in the fully-annotated training data reveal notable trends. Kidney and colon lesions tend to be larger on average. Lymph nodes, pancreas, and colon lesions exhibit a greater disparity between their long- and short-axis sizes, indicating that these lesions are more often non-spherical.
Figure 2: Examples of GrabCut pseudo-masks. From left to right, a kidney lesion, mediastinal lymph node, subcutaneous mass, and lung lesion. Note how GrabCut tends to oversegment (orange mask $\blacksquare$) into healthy tissues compared to the reference measurements (purple lines $\blacksquare$). Lung lesions are visualized using Window Level: -500 HU, Window Width: 1400 HU. Lesions outside the lungs with WL: 350 WW: 40.
Figure 3: Training pipeline for the semi-supervised baseline model. a) In the first training iteration a nnUnet is pretrained using the 2D GrabCut masks generated from the partially annotated data, and then fine-tuned on the fully annotated data. b) In the second training iteration a different nnUnet is pretrained using the predicted 3D pseudo-masks for the partially annotated data and then fine-tuned using the fully-annotated data.
Figure 4: Boxplots of the long- and short-axis measurement errors for the baseline model on the different lesion types in the held-out training data. SAPE = Symmetric Average Prediction Error.
Figure 5: Boxplots of the long- and short-axis measurement errors for the baseline model on the test set. The fully-supervised types are lung, liver, kidney, colon, pancreas, bone lesions and lymph nodes. Partially-supervised lesion types are those included in the partially annotated data e.g. adrenal, ovary, subcutaneous. SAPE = Symmetric Absolute Percentage Error.
...and 7 more figures

The ULS23 Challenge: a Baseline Model and Benchmark Dataset for 3D Universal Lesion Segmentation in Computed Tomography

TL;DR

Abstract

The ULS23 Challenge: a Baseline Model and Benchmark Dataset for 3D Universal Lesion Segmentation in Computed Tomography

Authors

TL;DR

Abstract

Table of Contents

Figures (12)