Table of Contents
Fetching ...

Benchmarking CNN-based Models against Transformer-based Models for Abdominal Multi-Organ Segmentation on the RATIC Dataset

Lukas Bayer, Sheethal Bhat, Andreas Maier

Abstract

Accurate multi-organ segmentation in abdominal CT scans is essential for computer-aided diagnosis and treatment. While convolutional neural networks (CNNs) have long been the standard approach in medical image segmentation, transformer-based architectures have recently gained attention due to their ability to model long-range dependencies. In this study, we systematically benchmark the three hybrid transformer-based models UNETR, SwinUNETR, and UNETR++ against a strong CNN baseline, SegResNet, for volumetric multi-organ segmentation on the heterogeneous RATIC dataset. The dataset comprises 206 annotated CT scans from 23 institutions worldwide, covering five abdominal organs. All models were trained and evaluated under identical preprocessing and training conditions using the Dice Similarity Coefficient (DSC) as the primary metric. The results show that the CNN-based SegResNet achieves the highest overall performance, outperforming all hybrid transformer-based models across all organs. Among the transformer-based approaches, UNETR++ delivers the most competitive results, while UNETR demonstrates notably faster convergence with fewer training iterations. These findings suggest that, for small- to medium-sized heterogeneous datasets, well-optimized CNN architectures remain highly competitive and may outperform hybrid transformer-based designs.

Benchmarking CNN-based Models against Transformer-based Models for Abdominal Multi-Organ Segmentation on the RATIC Dataset

Abstract

Accurate multi-organ segmentation in abdominal CT scans is essential for computer-aided diagnosis and treatment. While convolutional neural networks (CNNs) have long been the standard approach in medical image segmentation, transformer-based architectures have recently gained attention due to their ability to model long-range dependencies. In this study, we systematically benchmark the three hybrid transformer-based models UNETR, SwinUNETR, and UNETR++ against a strong CNN baseline, SegResNet, for volumetric multi-organ segmentation on the heterogeneous RATIC dataset. The dataset comprises 206 annotated CT scans from 23 institutions worldwide, covering five abdominal organs. All models were trained and evaluated under identical preprocessing and training conditions using the Dice Similarity Coefficient (DSC) as the primary metric. The results show that the CNN-based SegResNet achieves the highest overall performance, outperforming all hybrid transformer-based models across all organs. Among the transformer-based approaches, UNETR++ delivers the most competitive results, while UNETR demonstrates notably faster convergence with fewer training iterations. These findings suggest that, for small- to medium-sized heterogeneous datasets, well-optimized CNN architectures remain highly competitive and may outperform hybrid transformer-based designs.
Paper Structure (20 sections, 1 equation, 6 figures, 2 tables)

This paper contains 20 sections, 1 equation, 6 figures, 2 tables.

Figures (6)

  • Figure 1.1: Class wise distribution for train, test and validation dataset
  • Figure 1.2: Qualitative comparison of the performance of UNETR, SwinUNETR, UNETR++ and SegResNet on a test scan from the RATIC dataset. The figure shows the predicted segmentation masks for each model next to the original CT scan. The ground truth segmentation is shown for comparison.
  • Figure 1.3: Schematic visualization of the UNETR architecture as used in this benchmark. 0000-01
  • Figure 1.4: Schematic visualization of the SwinUNETR architecture as used in this benchmark. 0000-04
  • Figure 1.5: Schematic visualization of the UNETR++ architecture as used in this benchmark. 0000-06
  • ...and 1 more figures