Table of Contents
Fetching ...

FreeTumor: Large-Scale Generative Tumor Synthesis in Computed Tomography Images for Improving Tumor Recognition

Linshan Wu, Jiaxin Zhuang, Yanning Zhou, Sunan He, Jiabo Ma, Luyang Luo, Xi Wang, Xuefeng Ni, Xiaoling Zhong, Mingxiang Wu, Yinghua Zhao, Xiaohui Duan, Varut Vardhanabhuti, Pranav Rajpurkar, Hao Chen

TL;DR

FreeTumor tackles data scarcity in CT tumor recognition by leveraging large-scale unlabeled CT volumes to synthesize diverse, high-fidelity tumors on healthy organs. The method employs a two-stage adversarial framework guided by a segmentation-based discriminator, with online synthesis on unlabeled data and a quality-control mechanism to filter poor samples. Clinically validated via a Visual Turing Test with 13 radiologists, the synthetic tumors achieve high realism (average 51.1% sensitivity, 60.8% accuracy) and yield consistent improvements in segmentation (average Dice gain 6.7%) and early tumor detection (average sensitivity up 16.4%), across 12 public datasets. Compared with state-of-the-art synthesis methods and CT foundation models, FreeTumor shows superior generalization, including strong out-of-domain performance, suggesting strong potential for clinical adoption and improved patient outcomes.

Abstract

Tumor is a leading cause of death worldwide, with an estimated 10 million deaths attributed to tumor-related diseases every year. AI-driven tumor recognition unlocks new possibilities for more precise and intelligent tumor screening and diagnosis. However, the progress is heavily hampered by the scarcity of annotated datasets, which demands extensive annotation efforts by radiologists. To tackle this challenge, we introduce FreeTumor, an innovative Generative AI (GAI) framework to enable large-scale tumor synthesis for mitigating data scarcity. Specifically, FreeTumor effectively leverages a combination of limited labeled data and large-scale unlabeled data for tumor synthesis training. Unleashing the power of large-scale data, FreeTumor is capable of synthesizing a large number of realistic tumors on images for augmenting training datasets. To this end, we create the largest training dataset for tumor synthesis and recognition by curating 161,310 publicly available Computed Tomography (CT) volumes from 33 sources, with only 2.3% containing annotated tumors. To validate the fidelity of synthetic tumors, we engaged 13 board-certified radiologists in a Visual Turing Test to discern between synthetic and real tumors. Rigorous clinician evaluation validates the high quality of our synthetic tumors, as they achieved only 51.1% sensitivity and 60.8% accuracy in distinguishing our synthetic tumors from real ones. Through high-quality tumor synthesis, FreeTumor scales up the recognition training datasets by over 40 times, showcasing a notable superiority over state-of-the-art AI methods including various synthesis methods and foundation models. These findings indicate promising prospects of FreeTumor in clinical applications, potentially advancing tumor treatments and improving the survival rates of patients.

FreeTumor: Large-Scale Generative Tumor Synthesis in Computed Tomography Images for Improving Tumor Recognition

TL;DR

FreeTumor tackles data scarcity in CT tumor recognition by leveraging large-scale unlabeled CT volumes to synthesize diverse, high-fidelity tumors on healthy organs. The method employs a two-stage adversarial framework guided by a segmentation-based discriminator, with online synthesis on unlabeled data and a quality-control mechanism to filter poor samples. Clinically validated via a Visual Turing Test with 13 radiologists, the synthetic tumors achieve high realism (average 51.1% sensitivity, 60.8% accuracy) and yield consistent improvements in segmentation (average Dice gain 6.7%) and early tumor detection (average sensitivity up 16.4%), across 12 public datasets. Compared with state-of-the-art synthesis methods and CT foundation models, FreeTumor shows superior generalization, including strong out-of-domain performance, suggesting strong potential for clinical adoption and improved patient outcomes.

Abstract

Tumor is a leading cause of death worldwide, with an estimated 10 million deaths attributed to tumor-related diseases every year. AI-driven tumor recognition unlocks new possibilities for more precise and intelligent tumor screening and diagnosis. However, the progress is heavily hampered by the scarcity of annotated datasets, which demands extensive annotation efforts by radiologists. To tackle this challenge, we introduce FreeTumor, an innovative Generative AI (GAI) framework to enable large-scale tumor synthesis for mitigating data scarcity. Specifically, FreeTumor effectively leverages a combination of limited labeled data and large-scale unlabeled data for tumor synthesis training. Unleashing the power of large-scale data, FreeTumor is capable of synthesizing a large number of realistic tumors on images for augmenting training datasets. To this end, we create the largest training dataset for tumor synthesis and recognition by curating 161,310 publicly available Computed Tomography (CT) volumes from 33 sources, with only 2.3% containing annotated tumors. To validate the fidelity of synthetic tumors, we engaged 13 board-certified radiologists in a Visual Turing Test to discern between synthetic and real tumors. Rigorous clinician evaluation validates the high quality of our synthetic tumors, as they achieved only 51.1% sensitivity and 60.8% accuracy in distinguishing our synthetic tumors from real ones. Through high-quality tumor synthesis, FreeTumor scales up the recognition training datasets by over 40 times, showcasing a notable superiority over state-of-the-art AI methods including various synthesis methods and foundation models. These findings indicate promising prospects of FreeTumor in clinical applications, potentially advancing tumor treatments and improving the survival rates of patients.

Paper Structure

This paper contains 16 sections, 11 equations, 18 figures, 19 tables.

Figures (18)

  • Figure 1: Overview of the study. a. We explore tumor synthesis and segmentation on five types of tumors/lesions, i.e., liver tumors, pancreas tumors, kidney tumors, lung tumors, and COVID-19. b. The rapid advancements in medical imaging have enabled the collection of large-scale Computed Tomography (CT) data. However, annotated tumor datasets are scarce due to the extensive annotation burden. c. We curated 161,310 CT volumes from 33 public sources to enable large-scale tumor synthesis and recognition, with merely $2.3\%$ of them comprising annotated tumors. d. FreeTumor consists of two stages: synthesis training and segmentation training. In Stage 1, FreeTumor effectively unleashes the power of large-scale unlabeled data for tumor synthesis training. In Stage 2, FreeTumor synthesizes high-quality tumors on healthy organs, facilitating the integration of large-scale unlabeled data in tumor segmentation training. e. Clinical evaluation of synthetic tumors. We invited 13 board-certified radiologists to a Visual Turing Test to discern between synthetic and real tumors. Rigorous clinician evaluation validates the high quality of our synthetic tumors. f. Extensive segmentation results on 12 public datasets showcase the superiority of FreeTumor. Specifically, FreeTumor adopts SwinUNETR swin as the segmentation model and employs tumor synthesis for augmenting segmentation datasets. With large-scale synthetic tumors for training, FreeTumor surpasses the baseline SwinUNETR swin by significant margins, achieving $10.6\%$, $5.5\%$, $3.8\%$, $6.1\%$, and $7.9\%$ Dice score improvements for five types of tumors/lesions, respectively. g. Early tumor detection results. With tumor synthesis, FreeTumor yields average $+16.4\%$ sensitivity improvements.
  • Figure 2: Clinician evaluation of synthetic tumors. We engage 13 board-certified radiologists in a Visual Turing Test to discern between synthetic and real tumors as shown in Figure \ref{['fig_overview']} (e). a. Qualitative results of synthetic tumors. The upper row presents healthy organs and the lower row presents synthetic tumors on healthy organs (highlighted by red arrows). b. The sensitivity and specificity results across five types of tumors/lesions. c. The accuracy results across five types of tumors/lesions. d. The average sensitivity and specificity results across five types of tumors/lesions. e. The average accuracy results across five types of tumors/lesions. f. We divide the synthetic tumors into two groups, i.e., pass the Visual Turing Test and fail the Visual Turning test. Detailed results are presented in Extended Data Tables \ref{['table_turing_test_sen']}, \ref{['table_turing_test_acc']} and \ref{['table_average_turing']}.
  • Figure 3: Comparison with baseline tumor segmentation models. a-l. The 5-fold cross-validation results of 12 public datasets. Specifically, FreeTumor adopts SwinUNETR swin as the segmentation model for segmentation. Overall, FreeTumor brings an average $+6.7\%$ Dice score improvements (two-sided paired t-test $p$-values $<5.09\times10^{-5}$) over the baseline SwinUNETR swin. m-r. Out-of-domain evaluation. The standard deviations are obtained from five times of experiments. Specifically, we train the model on a source dataset and conduct direct inference on a target dataset without fine-tuning. For example, in (m), "LiTS to HCC-TACE" represents training a model on the LiTS LITS dataset and conducting inference on the HCC-TACE HCC-TACE-Seg dataset without fine-tuning. Compared with the baseline SwinUNETR swin, FreeTumor brings average $+12.3\%$ Dice score improvements (two-sided paired t-test $p$-values $<4.42\times10^{-3}$) in 6 out-of-domain experiments. Detailed results are presented in Extended Data Tables \ref{['table_results']} and \ref{['table_out_of_domain']}.
  • Figure 4: Comparison with tumor synthesis methods and CT foundation models. a-l. The 5-fold cross-validation results of 12 public datasets. SynTumor Syntumor and DiffTumor Difftumor are two tumor synthesis methods using the same segmentation model swin as FreeTumor, while SynTumor Syntumor is only applicable to liver tumors, and DiffTumor Difftumor is not applicable to lung tumors and COVID-19. We use a "cross mark" (✘) to signify that this method is not applicable to this dataset. For example, the "cross mark" in (d) means SynTumor Syntumor is not applicable to the pancreas tumor dataset MSD07 msd. In addition, MAE3D MAE, SwinSSL SwinSSL, and VoCo voco-v1 are three CT foundation models based on self-supervised learning. The same segmentation model swin is adopted for fair comparisons. Overall, on 12 public datasets, FreeTumor surpasses the best-competing method by average $5.1\%$ Dice scores (two-sided paired t-test $p$-values $< 3.78 \times 10^{-5}$). m-r. Out-of-domain evaluation. The standard deviations are obtained from five times of experiments. Overall, in 6 out-of-domain experiments, FreeTumor surpasses the best-competing method by average $7.9\%$ Dice scores (two-sided paired t-test $p$-values $<3.73\times10^{-3}$) in out-of-domain evaluation. Detailed results are presented in Extended Data Tables \ref{['table_compare_synthesis']}, \ref{['table_compare_ssl']}, and \ref{['table_out_of_domain']}.
  • Figure 5: Comprehensive analysis of tumor segmentation performance and data scaling effects. a. The overall Dice score comparisons with baseline tumor segmentation models UNETtransunetunetrnnUNetswin. Significance levels at which FreeTumor outperforms the baseline SwinUNETR swin, with two-sided paired t-test are ***$p$-values $< 1 \times 10^{-3}$ and ****$p$-values $< 1 \times 10^{-4}$. Exact $p$-values for the comparison between FreeTumor and SwinUNETR swin are: $p$-values $< 6.05 \times 10^{-7}$ for liver tumors, $p$-values $< 4.02 \times 10^{-7}$ for pancreas tumors, $p$-values $< 1.05 \times 10^{-5}$ for kidney tumors, $p$-values $< 7.37 \times 10^{-5}$ for lung tumors, and $p$-values $< 9.07 \times 10^{-4}$ for COVID-19. b. The average Dice scores of FreeTumor across five types of tumors/lesions. c. Qualitative segmentation results of FreeTumor. The organ segmentation results are presented for better visualization. d-h. The effectiveness of scaling up training datasets. We evaluate the correlation between the data scale of segmentation training datasets and segmentation performances. Specifically, the foundation models MAE3DSwinSSLvoco-v1 are unable to utilize unlabeled data in segmentation training, thus their data scale of segmentation training datasets are the same as the baseline models UNETtransunetunetrnnUNetswin. i. Comparisons between FreeTumor and previous methods UNETtransunetunetrnnUNetswinSyntumorDifftumor in data utilization. We assess these methods across three dimensions: the scale of training datasets (number of CT volumes), the utilization of unlabeled data in synthesis training, and the utilization of unlabeled data in segmentation training.
  • ...and 13 more figures