Table of Contents
Fetching ...

MedNeXt-v2: Scaling 3D ConvNeXts for Large-Scale Supervised Representation Learning in Medical Image Segmentation

Saikat Roy, Yannick Kirchhoff, Constantin Ulrich, Maximillian Rokuss, Tassilo Wald, Fabian Isensee, Klaus Maier-Hein

TL;DR

This work shifts focus from dataset size to the quality of learned representations in 3D medical image segmentation by validating backbones before large-scale pretraining and introducing MedNeXt-v2, a compound-scaled 3D ConvNeXt with 3D GRN. Pretraining on a large CT corpus and subsequent finetuning with increased context yield state-of-the-art results across six CT/MR benchmarks covering 144 structures, outperforming seven publicly released pretrained models. Key insights include the strong link between backbone strength and pretrained performance, the substantial benefit of tumor segmentation from pretraining, and the limited advantage of modality-specific pretraining when fine-tuning is used. The study provides practical guidance and open-source resources for scalable, supervised 3D medical representation learning.

Abstract

Large-scale supervised pretraining is rapidly reshaping 3D medical image segmentation. However, existing efforts focus primarily on increasing dataset size and overlook the question of whether the backbone network is an effective representation learner at scale. In this work, we address this gap by revisiting ConvNeXt-based architectures for volumetric segmentation and introducing MedNeXt-v2, a compound-scaled 3D ConvNeXt that leverages improved micro-architecture and data scaling to deliver state-of-the-art performance. First, we show that routinely used backbones in large-scale pretraining pipelines are often suboptimal. Subsequently, we use comprehensive backbone benchmarking prior to scaling and demonstrate that stronger from scratch performance reliably predicts stronger downstream performance after pretraining. Guided by these findings, we incorporate a 3D Global Response Normalization module and use depth, width, and context scaling to improve our architecture for effective representation learning. We pretrain MedNeXt-v2 on 18k CT volumes and demonstrate state-of-the-art performance when fine-tuning across six challenging CT and MR benchmarks (144 structures), showing consistent gains over seven publicly released pretrained models. Beyond improvements, our benchmarking of these models also reveals that stronger backbones yield better results on similar data, representation scaling disproportionately benefits pathological segmentation, and that modality-specific pretraining offers negligible benefit once full finetuning is applied. In conclusion, our results establish MedNeXt-v2 as a strong backbone for large-scale supervised representation learning in 3D Medical Image Segmentation. Our code and pretrained models are made available with the official nnUNet repository at: https://www.github.com/MIC-DKFZ/nnUNet

MedNeXt-v2: Scaling 3D ConvNeXts for Large-Scale Supervised Representation Learning in Medical Image Segmentation

TL;DR

This work shifts focus from dataset size to the quality of learned representations in 3D medical image segmentation by validating backbones before large-scale pretraining and introducing MedNeXt-v2, a compound-scaled 3D ConvNeXt with 3D GRN. Pretraining on a large CT corpus and subsequent finetuning with increased context yield state-of-the-art results across six CT/MR benchmarks covering 144 structures, outperforming seven publicly released pretrained models. Key insights include the strong link between backbone strength and pretrained performance, the substantial benefit of tumor segmentation from pretraining, and the limited advantage of modality-specific pretraining when fine-tuning is used. The study provides practical guidance and open-source resources for scalable, supervised 3D medical representation learning.

Abstract

Large-scale supervised pretraining is rapidly reshaping 3D medical image segmentation. However, existing efforts focus primarily on increasing dataset size and overlook the question of whether the backbone network is an effective representation learner at scale. In this work, we address this gap by revisiting ConvNeXt-based architectures for volumetric segmentation and introducing MedNeXt-v2, a compound-scaled 3D ConvNeXt that leverages improved micro-architecture and data scaling to deliver state-of-the-art performance. First, we show that routinely used backbones in large-scale pretraining pipelines are often suboptimal. Subsequently, we use comprehensive backbone benchmarking prior to scaling and demonstrate that stronger from scratch performance reliably predicts stronger downstream performance after pretraining. Guided by these findings, we incorporate a 3D Global Response Normalization module and use depth, width, and context scaling to improve our architecture for effective representation learning. We pretrain MedNeXt-v2 on 18k CT volumes and demonstrate state-of-the-art performance when fine-tuning across six challenging CT and MR benchmarks (144 structures), showing consistent gains over seven publicly released pretrained models. Beyond improvements, our benchmarking of these models also reveals that stronger backbones yield better results on similar data, representation scaling disproportionately benefits pathological segmentation, and that modality-specific pretraining offers negligible benefit once full finetuning is applied. In conclusion, our results establish MedNeXt-v2 as a strong backbone for large-scale supervised representation learning in 3D Medical Image Segmentation. Our code and pretrained models are made available with the official nnUNet repository at: https://www.github.com/MIC-DKFZ/nnUNet

Paper Structure

This paper contains 33 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: MedNeXt-v2 sets a new state-of-the-art in 3D medical image segmentation. By leveraging micro-architectural improvements and large-scale pretraining, it outperforms powerful existing networks across multiple 3D medical segmentation tasks.
  • Figure 2: Network Improvements. Our network scaling targets the base number of channels (C) while 3D GRN improves the micro architecture by limiting activation saturation or collapse during training. Also shown is our context scaling strategy.
  • Figure 3: Channel activation visualization demonstrates that 3D GRN reduces redundant activations. Akin to ConvNeXt-v2 woo2023convnext, our visualization of 64 activations in layer 1 of MedNeXt-v2 (with GRN) and MedNeXt-v1 (without GRN) on Pediatric-CT (D1), Stanford Knee (D2) and Pancreatic Tumor (D5) from \ref{['tab:main_results']} demonstrates that 3D GRN prevents dead or saturated activations in 3D medical image segmentation tasks, preventing feature collapse and aiding representation learning.
  • Figure 4: MedNeXt-v2 outperforms MedNeXt-v1 from scratch and during finetuning. The addition of the GRN stabilizes the performance of the v2 architecture and improves performances compared to MedNeXt-v1 as seen in \ref{['tab:main_results']}. We see improvements $>$1.0 Dice points on tasks as diverse as the segmentation of Pediatric Organs in CTs (D1) and Pancreatic Tumor in MR (D5). We only observe limited gains on the highly saturated knee segmentation task D2 for all methods.
  • Figure 5: Increased context during fine-tuning improves performance. Increasing the available spatial context to 3.375 times with $192^3$ patches is a cheap and effective strategy to leverage our pretrained MedNeXt-v2 while limiting pretraining costs. Importantly, we see an example from Toothfairy (D3) where added spatial context of the jaw enables better segmentation of the teeth near the image boundary, which a MedNeXt-v2 fine-tuned on $128^3$ patches is unable to segment accurately.
  • ...and 3 more figures