Table of Contents
Fetching ...

STU-Net: Scalable and Transferable Medical Image Segmentation Models Empowered by Large-Scale Supervised Pre-training

Ziyan Huang, Haoyu Wang, Zhongying Deng, Jin Ye, Yanzhou Su, Hui Sun, Junjun He, Yun Gu, Lixu Gu, Shaoting Zhang, Yu Qiao

TL;DR

This work introduces STU-Net, a family of scalable U-Net models built on nnU-Net for medical image segmentation. It demonstrates that scaling depth and width together, along with architectural refinements and weight-free upsampling, enables models from 14M to 1.4B parameters. Pre-training on the large TotalSegmentator dataset substantially improves transferability, enabling strong direct inference and fine-tuning performance across 14 and 3 downstream datasets, respectively. The results highlight the value of large-scale pre-training for cross-domain medical segmentation and establish STU-Net-H as a robust universal model, marking progress toward foundation-model-style MedAI.

Abstract

Large-scale models pre-trained on large-scale datasets have profoundly advanced the development of deep learning. However, the state-of-the-art models for medical image segmentation are still small-scale, with their parameters only in the tens of millions. Further scaling them up to higher orders of magnitude is rarely explored. An overarching goal of exploring large-scale models is to train them on large-scale medical segmentation datasets for better transfer capacities. In this work, we design a series of Scalable and Transferable U-Net (STU-Net) models, with parameter sizes ranging from 14 million to 1.4 billion. Notably, the 1.4B STU-Net is the largest medical image segmentation model to date. Our STU-Net is based on nnU-Net framework due to its popularity and impressive performance. We first refine the default convolutional blocks in nnU-Net to make them scalable. Then, we empirically evaluate different scaling combinations of network depth and width, discovering that it is optimal to scale model depth and width together. We train our scalable STU-Net models on a large-scale TotalSegmentator dataset and find that increasing model size brings a stronger performance gain. This observation reveals that a large model is promising in medical image segmentation. Furthermore, we evaluate the transferability of our model on 14 downstream datasets for direct inference and 3 datasets for further fine-tuning, covering various modalities and segmentation targets. We observe good performance of our pre-trained model in both direct inference and fine-tuning. The code and pre-trained models are available at https://github.com/Ziyan-Huang/STU-Net.

STU-Net: Scalable and Transferable Medical Image Segmentation Models Empowered by Large-Scale Supervised Pre-training

TL;DR

This work introduces STU-Net, a family of scalable U-Net models built on nnU-Net for medical image segmentation. It demonstrates that scaling depth and width together, along with architectural refinements and weight-free upsampling, enables models from 14M to 1.4B parameters. Pre-training on the large TotalSegmentator dataset substantially improves transferability, enabling strong direct inference and fine-tuning performance across 14 and 3 downstream datasets, respectively. The results highlight the value of large-scale pre-training for cross-domain medical segmentation and establish STU-Net-H as a robust universal model, marking progress toward foundation-model-style MedAI.

Abstract

Large-scale models pre-trained on large-scale datasets have profoundly advanced the development of deep learning. However, the state-of-the-art models for medical image segmentation are still small-scale, with their parameters only in the tens of millions. Further scaling them up to higher orders of magnitude is rarely explored. An overarching goal of exploring large-scale models is to train them on large-scale medical segmentation datasets for better transfer capacities. In this work, we design a series of Scalable and Transferable U-Net (STU-Net) models, with parameter sizes ranging from 14 million to 1.4 billion. Notably, the 1.4B STU-Net is the largest medical image segmentation model to date. Our STU-Net is based on nnU-Net framework due to its popularity and impressive performance. We first refine the default convolutional blocks in nnU-Net to make them scalable. Then, we empirically evaluate different scaling combinations of network depth and width, discovering that it is optimal to scale model depth and width together. We train our scalable STU-Net models on a large-scale TotalSegmentator dataset and find that increasing model size brings a stronger performance gain. This observation reveals that a large model is promising in medical image segmentation. Furthermore, we evaluate the transferability of our model on 14 downstream datasets for direct inference and 3 datasets for further fine-tuning, covering various modalities and segmentation targets. We observe good performance of our pre-trained model in both direct inference and fine-tuning. The code and pre-trained models are available at https://github.com/Ziyan-Huang/STU-Net.
Paper Structure (26 sections, 5 figures, 9 tables)

This paper contains 26 sections, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Segmentation performance of various models on the TotalSegmentator dataset. The area of each bubble is proportional to the FLOPs (Floating-Point Operations Per Second) of the corresponding model at different scales. Distinct colors represent different models, while multiple bubbles of the same color denote the same model with varying scales. FLOPs calculations are based on input patch sizes of $128\times 128\times 128$.
  • Figure 2: Illustration of our STU-Net architecture which is built upon the nnU-Net architecture with several modifications to enhance its scalability and transferability. (a) An overview of the STU-Net architecture. The blue arrows denote downsampling while the yellow ones represent upsampling. (b) Residual blocks to achieve a large-scale model. (c) Downsampling in the first residual block of each encoder stage. (d-e) Stem and segmentation head for channel conversion of input and output. (f) Weight-free interpolation for upsampling, which effectively addresses the issue of weight mismatch across different tasks.
  • Figure 3: Qualitative visualization of our STU-Net with different scales and nnU-Net on various medical imaging datasets. The representative cases from distinct datasets are displayed in each row, including Row 1 - FLARE22 dataset, Row 2 - AMOS dataset with CT images, Row 3 - AMOS dataset with MR images, Row 4 - AutoPET dataset with CT images, and Row 5 - AutoPET dataset with PET images. The seven columns from left to right correspond to the original image, the ground truth (gt), the nnU-Net results, and our STU-Net-B-ft, STU-Net-L-ft, and STU-Net-H-ft results.
  • Figure 4: Comparison of mean DSC ($\% \uparrow$) performance for STU-Net models with different scales, trained on subsets of the TotalSegmentator training set with different proportions of training cases, and evaluated on the same TotalSegmentator validation set.
  • Figure 5: Comparison between five specialized expert STU-Net models and a single universal STU-Net model on the TotalSegmentator dataset. Each expert model targets one of the five subcategories (i.e., organs, vertebrae, cardiac, muscles, and ribs), while the universal model is trained on all 104 classes. The performance is measured using the mean DSC across various anatomical categories: the five subcategories and an overall performance metric for TotalSegmentator dataset. STU-Net architectures (S, B, L, and H) are depicted for both expert and universal models. Lighter colors represent expert models, and darker colors indicate universal models.