Table of Contents
Fetching ...

WORD: A large scale dataset, benchmark and clinical applicable study for abdominal organ segmentation from CT image

Xiangde Luo, Wenjun Liao, Jianghong Xiao, Jieneng Chen, Tao Song, Xiaofan Zhang, Kang Li, Dimitris N. Metaxas, Guotai Wang, Shaoting Zhang

TL;DR

The paper introduces WORD, a large-scale, fully annotated whole-abdominal CT dataset with 16 organs, designed to benchmark segmentation methods and support clinical translation. It provides thorough evaluations of state-of-the-art models, clinician-model gaps, cross-dataset generalization, and strategies for inference- and annotation-efficient segmentation, including scribble-based weak supervision. Key contributions include establishing a high-quality benchmark, analyzing domain shifts across public datasets, and proposing regularized scribble-learning approaches that substantially reduce labeling cost while maintaining competitive accuracy. The work underscores ongoing challenges in small-organ segmentation and sets a foundation for robust, clinically applicable abdominal multi-organ segmentation research.

Abstract

Whole abdominal organ segmentation is important in diagnosing abdomen lesions, radiotherapy, and follow-up. However, oncologists' delineating all abdominal organs from 3D volumes is time-consuming and very expensive. Deep learning-based medical image segmentation has shown the potential to reduce manual delineation efforts, but it still requires a large-scale fine annotated dataset for training, and there is a lack of large-scale datasets covering the whole abdomen region with accurate and detailed annotations for the whole abdominal organ segmentation. In this work, we establish a new large-scale \textit{W}hole abdominal \textit{OR}gan \textit{D}ataset (\textit{WORD}) for algorithm research and clinical application development. This dataset contains 150 abdominal CT volumes (30495 slices). Each volume has 16 organs with fine pixel-level annotations and scribble-based sparse annotations, which may be the largest dataset with whole abdominal organ annotation. Several state-of-the-art segmentation methods are evaluated on this dataset. And we also invited three experienced oncologists to revise the model predictions to measure the gap between the deep learning method and oncologists. Afterwards, we investigate the inference-efficient learning on the WORD, as the high-resolution image requires large GPU memory and a long inference time in the test stage. We further evaluate the scribble-based annotation-efficient learning on this dataset, as the pixel-wise manual annotation is time-consuming and expensive. The work provided a new benchmark for the abdominal multi-organ segmentation task, and these experiments can serve as the baseline for future research and clinical application development.

WORD: A large scale dataset, benchmark and clinical applicable study for abdominal organ segmentation from CT image

TL;DR

The paper introduces WORD, a large-scale, fully annotated whole-abdominal CT dataset with 16 organs, designed to benchmark segmentation methods and support clinical translation. It provides thorough evaluations of state-of-the-art models, clinician-model gaps, cross-dataset generalization, and strategies for inference- and annotation-efficient segmentation, including scribble-based weak supervision. Key contributions include establishing a high-quality benchmark, analyzing domain shifts across public datasets, and proposing regularized scribble-learning approaches that substantially reduce labeling cost while maintaining competitive accuracy. The work underscores ongoing challenges in small-organ segmentation and sets a foundation for robust, clinically applicable abdominal multi-organ segmentation research.

Abstract

Whole abdominal organ segmentation is important in diagnosing abdomen lesions, radiotherapy, and follow-up. However, oncologists' delineating all abdominal organs from 3D volumes is time-consuming and very expensive. Deep learning-based medical image segmentation has shown the potential to reduce manual delineation efforts, but it still requires a large-scale fine annotated dataset for training, and there is a lack of large-scale datasets covering the whole abdomen region with accurate and detailed annotations for the whole abdominal organ segmentation. In this work, we establish a new large-scale \textit{W}hole abdominal \textit{OR}gan \textit{D}ataset (\textit{WORD}) for algorithm research and clinical application development. This dataset contains 150 abdominal CT volumes (30495 slices). Each volume has 16 organs with fine pixel-level annotations and scribble-based sparse annotations, which may be the largest dataset with whole abdominal organ annotation. Several state-of-the-art segmentation methods are evaluated on this dataset. And we also invited three experienced oncologists to revise the model predictions to measure the gap between the deep learning method and oncologists. Afterwards, we investigate the inference-efficient learning on the WORD, as the high-resolution image requires large GPU memory and a long inference time in the test stage. We further evaluate the scribble-based annotation-efficient learning on this dataset, as the pixel-wise manual annotation is time-consuming and expensive. The work provided a new benchmark for the abdominal multi-organ segmentation task, and these experiments can serve as the baseline for future research and clinical application development.

Paper Structure

This paper contains 29 sections, 5 equations, 6 figures, 15 tables.

Figures (6)

  • Figure 1: An example of 16 annotated abdominal organs in a CT scan. The left table lists the annotated organs' categories. (a), (b), (c) denote the visualization in axial, coronal, and sagittal views, respectively. (d) represents the 3D rendering results of annotated abdomen organs.
  • Figure 2: Volume distribution of 16 organs in WORD.
  • Figure 3: User study based on three junior oncologists independently, each of them comes from a different hospital.
  • Figure 4: Visual comparison of segmentation performance on four different datasets. All predictions were produced by the nnUNetV2 (3D) pre-trained on the WORD.
  • Figure 5: Intensity distributions comparison of LiTS, BTCV, TCIA and WORDs. HU means Hounsfield Unit.
  • ...and 1 more figures