A-Eval: A Benchmark for Cross-Dataset Evaluation of Abdominal Multi-Organ Segmentation

Ziyan Huang; Zhongying Deng; Jin Ye; Haoyu Wang; Yanzhou Su; Tianbin Li; Hui Sun; Junlong Cheng; Jianpin Chen; Junjun He; Yun Gu; Shaoting Zhang; Lixu Gu; Yu Qiao

A-Eval: A Benchmark for Cross-Dataset Evaluation of Abdominal Multi-Organ Segmentation

Ziyan Huang, Zhongying Deng, Jin Ye, Haoyu Wang, Yanzhou Su, Tianbin Li, Hui Sun, Junlong Cheng, Jianpin Chen, Junjun He, Yun Gu, Shaoting Zhang, Lixu Gu, Yu Qiao

TL;DR

The paper tackles the challenge of cross-dataset generalization in abdominal multi-organ segmentation by introducing A-Eval, a cross-dataset benchmark that combines training data from FLARE22, AMOS, WORD, and TotalSegmentator with BTCV for evaluation across five datasets. It adopts the STU-Net (nnU-Net derivative) architecture in multiple sizes and evaluates a range of data usage strategies, including pseudo-labeling, multi-modality training, and joint training across datasets, to quantify their impact on generalization. Key findings show that larger, more diverse training data, unlabeled data through pseudo-labeling, and joint training substantially improve cross-dataset performance, while very large models require commensurate data to realize gains. The study provides actionable guidance for assembling large-scale abdominal datasets and designing training protocols to enhance generalizability in real-world clinical contexts, with the code and pretrained models released at the project repository.

Abstract

Although deep learning have revolutionized abdominal multi-organ segmentation, models often struggle with generalization due to training on small, specific datasets. With the recent emergence of large-scale datasets, some important questions arise: \textbf{Can models trained on these datasets generalize well on different ones? If yes/no, how to further improve their generalizability?} To address these questions, we introduce A-Eval, a benchmark for the cross-dataset Evaluation ('Eval') of Abdominal ('A') multi-organ segmentation. We employ training sets from four large-scale public datasets: FLARE22, AMOS, WORD, and TotalSegmentator, each providing extensive labels for abdominal multi-organ segmentation. For evaluation, we incorporate the validation sets from these datasets along with the training set from the BTCV dataset, forming a robust benchmark comprising five distinct datasets. We evaluate the generalizability of various models using the A-Eval benchmark, with a focus on diverse data usage scenarios: training on individual datasets independently, utilizing unlabeled data via pseudo-labeling, mixing different modalities, and joint training across all available datasets. Additionally, we explore the impact of model sizes on cross-dataset generalizability. Through these analyses, we underline the importance of effective data usage in enhancing models' generalization capabilities, offering valuable insights for assembling large-scale datasets and improving training strategies. The code and pre-trained models are available at \href{https://github.com/uni-medical/A-Eval}{https://github.com/uni-medical/A-Eval}.

A-Eval: A Benchmark for Cross-Dataset Evaluation of Abdominal Multi-Organ Segmentation

TL;DR

Abstract

Paper Structure (19 sections, 2 equations, 3 figures, 13 tables)

This paper contains 19 sections, 2 equations, 3 figures, 13 tables.

Introduction
Related Work
Abdominal Multi-Organ Segmentation Benchmarks
Model Generalizability
A-Eval Benchmark
Datasets for A-Eval
Cross-Dataset Protocols
Model Architecture and Training Procedure
Evaluation Metrics and Inference Procedure
Experiments and Results
Implementation Details
Cross-Dataset Evaluation for Models Trained on Individual Datasets
Impact of Pseudo-Labeling on Model Generalizability
Impact of Multi-Modality Data on Model Generalizability
Improving Generalizability Through Joint Training Across Multiple Datasets
...and 4 more sections

Figures (3)

Figure 1: Comparison of the original evaluation approach versus our proposed A-Eval benchmark for assessing model generalizability. (a) Original evaluation involves training and testing on the same dataset, providing good results but leaving uncertainty when applied to other datasets. (b) A-Eval, on the other hand, trains and tests across different datasets, offering a more comprehensive evaluation of model performance and its potential for generalizability.
Figure 2: Visualization of the performance of STU-Net-L models trained individually on different datasets (FLARE22 FLARE22, AMOS CT ji2022amos, AMOS MR ji2022amos, WORD luo2022word, TotalSegmentator totalsegmentator) and validated on multiple datasets (FLARE22 FLARE22, AMOS CT ji2022amos, AMOS MR ji2022amos, WORD luo2022word, TotalSegmentator totalsegmentator, BTCV landman2015btcv) within the A-Eval Benchmark. Each row corresponds to testing on a different dataset, while each column depicts various elements: the original image, ground truth, and the segmentation results obtained from models trained individually on different datasets.
Figure 3: Comparison of generalizability for STU-Net models of different sizes. Blue and red bars denote mean DSC and NSD values, respectively. Means are calculated from 20 cross-dataset evaluations (trained on four datasets and tested on five). Error bars represent the standard deviation, indicating the model's generalizability variability.

A-Eval: A Benchmark for Cross-Dataset Evaluation of Abdominal Multi-Organ Segmentation

TL;DR

Abstract

A-Eval: A Benchmark for Cross-Dataset Evaluation of Abdominal Multi-Organ Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)