Table of Contents
Fetching ...

nnU-Net Revisited: A Call for Rigorous Validation in 3D Medical Image Segmentation

Fabian Isensee, Tassilo Wald, Constantin Ulrich, Michael Baumgartner, Saikat Roy, Klaus Maier-Hein, Paul F. Jaeger

TL;DR

This paper interrogates the recent push toward novel architectures for 3D medical image segmentation by applying a rigorous, standardized validation protocol. It conducts a large-scale benchmark across CNN-, Transformer-, and Mamba-based methods within the nnU-Net framework, using fixed hardware budgets and a diverse, carefully selected dataset suite. The key finding is that CNN-based U-Net variants, configured and scaled appropriately, continue to outperform Transformer- and Mamba-based approaches, with Auto3DSeg underperforming relative to nnU-Net. The work emphasizes a cultural shift toward rigorous validation, proposes standardized baselines and dataset suitability criteria, and provides practical guidance to reduce validation bias in future 3D segmentation research.

Abstract

The release of nnU-Net marked a paradigm shift in 3D medical image segmentation, demonstrating that a properly configured U-Net architecture could still achieve state-of-the-art results. Despite this, the pursuit of novel architectures, and the respective claims of superior performance over the U-Net baseline, continued. In this study, we demonstrate that many of these recent claims fail to hold up when scrutinized for common validation shortcomings, such as the use of inadequate baselines, insufficient datasets, and neglected computational resources. By meticulously avoiding these pitfalls, we conduct a thorough and comprehensive benchmarking of current segmentation methods including CNN-based, Transformer-based, and Mamba-based approaches. In contrast to current beliefs, we find that the recipe for state-of-the-art performance is 1) employing CNN-based U-Net models, including ResNet and ConvNeXt variants, 2) using the nnU-Net framework, and 3) scaling models to modern hardware resources. These results indicate an ongoing innovation bias towards novel architectures in the field and underscore the need for more stringent validation standards in the quest for scientific progress.

nnU-Net Revisited: A Call for Rigorous Validation in 3D Medical Image Segmentation

TL;DR

This paper interrogates the recent push toward novel architectures for 3D medical image segmentation by applying a rigorous, standardized validation protocol. It conducts a large-scale benchmark across CNN-, Transformer-, and Mamba-based methods within the nnU-Net framework, using fixed hardware budgets and a diverse, carefully selected dataset suite. The key finding is that CNN-based U-Net variants, configured and scaled appropriately, continue to outperform Transformer- and Mamba-based approaches, with Auto3DSeg underperforming relative to nnU-Net. The work emphasizes a cultural shift toward rigorous validation, proposes standardized baselines and dataset suitability criteria, and provides practical guidance to reduce validation bias in future 3D segmentation research.

Abstract

The release of nnU-Net marked a paradigm shift in 3D medical image segmentation, demonstrating that a properly configured U-Net architecture could still achieve state-of-the-art results. Despite this, the pursuit of novel architectures, and the respective claims of superior performance over the U-Net baseline, continued. In this study, we demonstrate that many of these recent claims fail to hold up when scrutinized for common validation shortcomings, such as the use of inadequate baselines, insufficient datasets, and neglected computational resources. By meticulously avoiding these pitfalls, we conduct a thorough and comprehensive benchmarking of current segmentation methods including CNN-based, Transformer-based, and Mamba-based approaches. In contrast to current beliefs, we find that the recipe for state-of-the-art performance is 1) employing CNN-based U-Net models, including ResNet and ConvNeXt variants, 2) using the nnU-Net framework, and 3) scaling models to modern hardware resources. These results indicate an ongoing innovation bias towards novel architectures in the field and underscore the need for more stringent validation standards in the quest for scientific progress.
Paper Structure (11 sections, 1 figure, 6 tables)

This paper contains 11 sections, 1 figure, 6 tables.

Figures (1)

  • Figure 1: Benchmarking suitability of popular datasets measured as the ratio of inter- versus intra-method standard deviation (SD). The dashed line denotes a ratio of 1.