Table of Contents
Fetching ...

Text-Driven Tumor Synthesis

Xinran Li, Yi Shuai, Chen Liu, Qi Chen, Qilong Wu, Pengfei Guo, Dong Yang, Can Zhao, Pedro R. A. S. Bassi, Daguang Xu, Kang Wang, Yang Yang, Alan Yuille, Zongwei Zhou

TL;DR

TextoMorph introduces a text-conditioned framework for 3D tumor synthesis that leverages radiology reports to control texture, boundaries, heterogeneity, and pathology. By combining a Text-Driven 3D Latent Diffusion Model with text extraction/generation, large-scale contrastive learning, and targeted data augmentation, the method generates diverse, text-consistent tumors while reducing reliance on scarce image–report pairs. Rigorous evaluation, including a Text-Driven Visual Turing Test and Radiomics Pattern Analysis, demonstrates superior realism and texture diversity, and ablations show additive gains in tumor detection, segmentation, and classification. The approach promises practical impact by delivering targeted data augmentation across clinically relevant tasks and is adaptable to demographic diversity and privacy-conscious data synthesis.

Abstract

Tumor synthesis can generate examples that AI often misses or over-detects, improving AI performance by training on these challenging cases. However, existing synthesis methods, which are typically unconditional -- generating images from random variables -- or conditioned only by tumor shapes, lack controllability over specific tumor characteristics such as texture, heterogeneity, boundaries, and pathology type. As a result, the generated tumors may be overly similar or duplicates of existing training data, failing to effectively address AI's weaknesses. We propose a new text-driven tumor synthesis approach, termed TextoMorph, that provides textual control over tumor characteristics. This is particularly beneficial for examples that confuse the AI the most, such as early tumor detection (increasing Sensitivity by +8.5%), tumor segmentation for precise radiotherapy (increasing DSC by +6.3%), and classification between benign and malignant tumors (improving Sensitivity by +8.2%). By incorporating text mined from radiology reports into the synthesis process, we increase the variability and controllability of the synthetic tumors to target AI's failure cases more precisely. Moreover, TextoMorph uses contrastive learning across different texts and CT scans, significantly reducing dependence on scarce image-report pairs (only 141 pairs used in this study) by leveraging a large corpus of 34,035 radiology reports. Finally, we have developed rigorous tests to evaluate synthetic tumors, including Text-Driven Visual Turing Test and Radiomics Pattern Analysis, showing that our synthetic tumors is realistic and diverse in texture, heterogeneity, boundaries, and pathology.

Text-Driven Tumor Synthesis

TL;DR

TextoMorph introduces a text-conditioned framework for 3D tumor synthesis that leverages radiology reports to control texture, boundaries, heterogeneity, and pathology. By combining a Text-Driven 3D Latent Diffusion Model with text extraction/generation, large-scale contrastive learning, and targeted data augmentation, the method generates diverse, text-consistent tumors while reducing reliance on scarce image–report pairs. Rigorous evaluation, including a Text-Driven Visual Turing Test and Radiomics Pattern Analysis, demonstrates superior realism and texture diversity, and ablations show additive gains in tumor detection, segmentation, and classification. The approach promises practical impact by delivering targeted data augmentation across clinically relevant tasks and is adaptable to demographic diversity and privacy-conscious data synthesis.

Abstract

Tumor synthesis can generate examples that AI often misses or over-detects, improving AI performance by training on these challenging cases. However, existing synthesis methods, which are typically unconditional -- generating images from random variables -- or conditioned only by tumor shapes, lack controllability over specific tumor characteristics such as texture, heterogeneity, boundaries, and pathology type. As a result, the generated tumors may be overly similar or duplicates of existing training data, failing to effectively address AI's weaknesses. We propose a new text-driven tumor synthesis approach, termed TextoMorph, that provides textual control over tumor characteristics. This is particularly beneficial for examples that confuse the AI the most, such as early tumor detection (increasing Sensitivity by +8.5%), tumor segmentation for precise radiotherapy (increasing DSC by +6.3%), and classification between benign and malignant tumors (improving Sensitivity by +8.2%). By incorporating text mined from radiology reports into the synthesis process, we increase the variability and controllability of the synthetic tumors to target AI's failure cases more precisely. Moreover, TextoMorph uses contrastive learning across different texts and CT scans, significantly reducing dependence on scarce image-report pairs (only 141 pairs used in this study) by leveraging a large corpus of 34,035 radiology reports. Finally, we have developed rigorous tests to evaluate synthetic tumors, including Text-Driven Visual Turing Test and Radiomics Pattern Analysis, showing that our synthetic tumors is realistic and diverse in texture, heterogeneity, boundaries, and pathology.

Paper Structure

This paper contains 28 sections, 4 equations, 12 figures, 11 tables.

Figures (12)

  • Figure 1: Text-Driven Tumor Synthesis. Existing tumor synthesis methods struggle with limited controllability, often generating tumors based solely on predefined shapes or random noise. This results in synthetic data that lacks essential features like texture, boundaries, and attenuation, reducing its effectiveness in addressing AI weaknesses. TextoMorph addresses this limitation by exploiting a dataset of 34,176 radiology reports to generate tumors with medically precise features described in clinical language. Examples include phrases such as 'hypodensity', 'ill-defined', and 'cystic', paired with CT scans of the liver, pancreas, and kidney.
  • Figure 2: Overview of the TextoMorph Framework. The framework consists of four steps: (1) Given a radiology report, we first perform text extraction and generation to obtain descriptive phrases (e.g., conglomerate metastasis with mixed interval response). These phrases are encoded by a text encoder (implemented via CLIP) to produce language representations guiding tumor synthesis. (2) Based on textual information and latent CT features, we train a Text-Driven 3D Diffusion Model with Encoder ($E$) and Decoder ($D$) to generate high-fidelity synthetic tumors consistent with the report descriptions. (3) Contrastive learning operations (Push vs. Pull) ensure that reports with consistent descriptive words ($R_0$ vs. $R_{0}^{\prime}$) generate similar tumors from different CTs, while distinct reports ($R_0$ vs. $R_{1}$) yield differentiable tumor features. (4) To enhance AI performance in detection, segmentation, and classification tasks, we extract descriptive texts from false positive samples to generate similar tumor examples, thereby improving the model's recognition of complex lesions.
  • Figure 3: Text-Driven Contrastive Learning. We illustrate the contrastive learning approach in the diffusion model for tumor synthesis control. The negative pair shows that different descriptive words (e.g., 'hypoattenuating' vs. 'heterogeneous') applied to the same CT scan generate distinct tumors, enforcing that different descriptions lead to distinguishable features. The positive pair shows the consistent descriptive words (e.g., 'hypoattenuating vs. hypoattenuating') applied to two different CT scans, resulting in similar tumor features, thus ensuring consistency for identical descriptions across varying CT contexts. This strategy aligns the textual descriptions with tumor synthesis, promoting both distinctiveness and consistency.
  • Figure 4: Tumor Detection and Segmentation. Comparison of the performance of different tumor generation models in a radial plot with an outer ring value of 90. The models include TextoMorph (full) and its variants excluding Text Extraction and Generation (No Text E-G), Text-Driven Contrastive Learning (No Contrastive Loss), and Targeted Data Augmentation (No T-D-A), along with DiffTumor and RealTumor. Performance metrics include sensitivity for small ($d < 20\,\mathrm{mm}$), medium ($20 \leq d < 50\,\mathrm{mm}$), and large ($d \geq 50\,\mathrm{mm}$) tumors, Dice Similarity Coefficient (DSC), and Normalized Surface Distance (NSD). Each configuration uses distinct colors or line styles to highlight the impact of individual components. See \ref{['sec:supp_Ablation_overall']} for tabular results.
  • Figure 5: Generalizable Across Different Patient Demographics. TextoMorph demonstrates consistent performance improvements in detecting benign tumors (pancreatic cysts) in both tumor-wise Sensitivity (%) and segmentation DSC (%) across various patient groups. Results of detecting malignant tumors in the pancreas (e.g., PDAC) can be found in \ref{['sec:Generalizable']}.
  • ...and 7 more figures