How Well Do Supervised 3D Models Transfer to Medical Imaging Tasks?

Wenxuan Li; Alan Yuille; Zongwei Zhou

How Well Do Supervised 3D Models Transfer to Medical Imaging Tasks?

Wenxuan Li, Alan Yuille, Zongwei Zhou

TL;DR

This study addresses the lack of large-scale 3D annotated data for transfer learning in medical imaging by creating AbdomenAtlas 1.1, a 9,262 CT-volume corpus with per-voxel labels for 25 structures and 7 tumor types. It proposes SuPreM, a suite of supervised pre-trained models across CNN and Transformer backbones trained on AbdomenAtlas 1.1, and demonstrates superior transfer to 3D segmentation tasks compared with self-supervised baselines. Key findings show that supervised pre-training is dramatically more data- and compute-efficient, with models trained on as few as 21 volumes matching or exceeding the performance of much larger self-supervised counterparts, and that SuPreM generalizes well to external datasets and novel classes with fewer annotations. The work highlights practical implications for rapid, robust 3D medical image analysis and invites broader data-sharing and model-release efforts to advance clinical AI tools.

Abstract

The pre-training and fine-tuning paradigm has become prominent in transfer learning. For example, if the model is pre-trained on ImageNet and then fine-tuned to PASCAL, it can significantly outperform that trained on PASCAL from scratch. While ImageNet pre-training has shown enormous success, it is formed in 2D, and the learned features are for classification tasks; when transferring to more diverse tasks, like 3D image segmentation, its performance is inevitably compromised due to the deviation from the original ImageNet context. A significant challenge lies in the lack of large, annotated 3D datasets rivaling the scale of ImageNet for model pre-training. To overcome this challenge, we make two contributions. Firstly, we construct AbdomenAtlas 1.1 that comprises 9,262 three-dimensional computed tomography (CT) volumes with high-quality, per-voxel annotations of 25 anatomical structures and pseudo annotations of seven tumor types. Secondly, we develop a suite of models that are pre-trained on our AbdomenAtlas 1.1 for transfer learning. Our preliminary analyses indicate that the model trained only with 21 CT volumes, 672 masks, and 40 GPU hours has a transfer learning ability similar to the model trained with 5,050 (unlabeled) CT volumes and 1,152 GPU hours. More importantly, the transfer learning ability of supervised models can further scale up with larger annotated datasets, achieving significantly better performance than preexisting pre-trained models, irrespective of their pre-training methodologies or data sources. We hope this study can facilitate collective efforts in constructing larger 3D medical datasets and more releases of supervised pre-trained models.

How Well Do Supervised 3D Models Transfer to Medical Imaging Tasks?

TL;DR

Abstract

Paper Structure (24 sections, 12 figures, 6 tables)

This paper contains 24 sections, 12 figures, 6 tables.

Introduction
Brief History: Supervised Pre-Training
Material & Method
Extensive Dataset: AbdomenAtlas 1.1
A Suite of Pre-trained Models: SuPreM
Experiment & Analysis
Data, Annotation, and Computational Efficiency
Enhanced Features for Novel Datasets, Classes, and Tasks
Conclusion and Discussion
Appendix
An Extensive Dataset: AbdomenAtlas 1.1
Domain Transfer Across Datasets
Uniform Annotation Standards
A Suite of Pre-trained Models: SuPreM
Supervised and Self-supervised Benchmarking
...and 9 more sections

Figures (12)

Figure 1: We present the transfer performance on a proprietary dataset with few-shot examples ($N$ = $5,10,20$). The transfer performance (Y-axis) stands for the average DSC score across 20-class organ segmentation and 3-class tumor segmentation. Generally speaking, in a few-shot learning setting, supervised pre-trained models (in red) transfer better than self-supervised pre-trained models (in gray). Notably, our SuPreM achieves the best transfer performance over other well-known publicly available models.
Figure 2: Analysis of pre-training and fine-tuning efficiency. For a fair comparison, both supervised (in red) and self-supervised (in gray) models use Swin UNETR as the backbone, and the compared self-supervised pre-training is the current state of the art tang2022self. The target task was on TotalSegmentator v1. (a) scales the model transfer learning ability when pre-trained on varying numbers of images. The results indicate a consistent improvement in transfer learning ability when pre-training on more images. The model trained with 21 CT volumes, 672 masks, and 40 GPU hours shows a transfer learning ability similar to that trained with 5,050 CT volumes and 1,152 GPU hours. Specifically, supervised pre-training is more efficient, requiring 99.6% fewer data and 96.5% less computation. (b) assesses the annotation & learning efficiency by fine-tuning models on different numbers of annotated CT volumes from TotalSegmentator. Specifically, SuPreM, fine-tuned on 512 per-voxel annotated CT volumes, can achieve a segmentation performance on par with self-supervised models fine-tuned on 1,024 volumes, reducing 50% manual annotation cost for target tasks.
Figure 3: Fine-tuning SuPreM on fine-grained tumor classification. We plot receiver operating characteristic (ROC) curves to evaluate the transfer learning performance of tumor classification. Detecting Cysts and PanNETs raises additional challenges for AI because these lesions exhibit a greater variety of texture patterns than PDACs. This diversity in texture patterns is reflected in the values of the Area Under the Curve (AUC) that we obtained. For all three sub-types of pancreatic tumors, SuPreM (in red) demonstrates superior performance over the self-supervised model tang2022self (in gray), showcasing its effectiveness in fine-grained tumor classification.
Figure 4: Evolution from a combination of public data to AbdomenAtlas 1.1. AbdomenAtlas 1.1 is NOT a simple combination of existing datasets. The 9,262 CT volumes in the combination of public datasets only contain a total of 39K annotated organ masks while our AbdomenAtlas 1.1 provides over 251,323 annotated organ/tumor masks for these CT volumes, substantially increasing the number of masks by 6.4 times. Creating 251,323 high-quality organ/tumor masks for 9,262 CT volumes requires extensive medical knowledge and annotation costs (much more difficult than annotating natural images). Based on our experience and those reported in park2020annotated, trained radiologists annotate abdominal organs at a rate of 30--60 minutes per organ per CT volume. This translates to 247K human hours for completing AbdomenAtlas 1.1. We employed a highly efficient annotation method, combining AI with the expertise of ten radiologists using active learning (details in \ref{['sec:annotation_standard_appendix']}), to overcome this challenge and produce the largest annotated dataset to date.
Figure 5: Domain gaps. Examples of CT volumes from different domains (e.g., hospitals and countries) illustrate the variability in images. AbdomenAtlas 1.1 are created by a large variety of CT scanners, imaging protocols, and acquired from numerous hospitals worldwide (\ref{['tab:dataset_overview']}). We note that substantial differences in CT volumes occur in image quality and technical display, originating from different acquisition parameters, reconstruction kernels, and contrast enhancements.
...and 7 more figures

How Well Do Supervised 3D Models Transfer to Medical Imaging Tasks?

TL;DR

Abstract

How Well Do Supervised 3D Models Transfer to Medical Imaging Tasks?

Authors

TL;DR

Abstract

Table of Contents

Figures (12)