Table of Contents
Fetching ...

Boosting Medical Image-based Cancer Detection via Text-guided Supervision from Reports

Guangyu Guo, Jiawen Yao, Yingda Xia, Tony C. W. Mok, Zhilin Zheng, Junwei Han, Le Lu, Dingwen Zhang, Jian Zhou, Ling Zhang

TL;DR

The paper tackles the annotation bottleneck in CT-based cancer screening by leveraging clinical reports as weak supervision via a text-guided weakly semi-supervised framework that embeds diagnosis and tumor location prompts into a vision-language model. It introduces a two-step training pipeline: a segmentation teacher that generates pseudo masks and a cancer-detection student that learns from both full and pseudo annotations, with text-guided losses operating in the CLIP latent space to stabilize learning. Evaluated on a large esophageal cancer dataset of 1,651 patients, the method achieves AUC 0.961 using only 30% fully annotated data, closely matching the 0.966 of fully supervised models and reducing annotation effort by at least 70%. Ablation studies confirm the value of text-guided cues and prompt design, demonstrating improved pseudo masks, robust joint segmentation-detection performance, and competitive results against strong baselines. The approach offers a scalable path for text-informed, low-label cancer screening applicable to multiple organ sites and imaging modalities.

Abstract

The absence of adequately sufficient expert-level tumor annotations hinders the effectiveness of supervised learning based opportunistic cancer screening on medical imaging. Clinical reports (that are rich in descriptive textual details) can offer a "free lunch'' supervision information and provide tumor location as a type of weak label to cope with screening tasks, thus saving human labeling workloads, if properly leveraged. However, predicting cancer only using such weak labels can be very changeling since tumors are usually presented in small anatomical regions compared to the whole 3D medical scans. Weakly semi-supervised learning (WSSL) utilizes a limited set of voxel-level tumor annotations and incorporates alongside a substantial number of medical images that have only off-the-shelf clinical reports, which may strike a good balance between minimizing expert annotation workload and optimizing screening efficacy. In this paper, we propose a novel text-guided learning method to achieve highly accurate cancer detection results. Through integrating diagnostic and tumor location text prompts into the text encoder of a vision-language model (VLM), optimization of weakly supervised learning can be effectively performed in the latent space of VLM, thereby enhancing the stability of training. Our approach can leverage clinical knowledge by large-scale pre-trained VLM to enhance generalization ability, and produce reliable pseudo tumor masks to improve cancer detection. Our extensive quantitative experimental results on a large-scale cancer dataset, including 1,651 unique patients, validate that our approach can reduce human annotation efforts by at least 70% while maintaining comparable cancer detection accuracy to competing fully supervised methods (AUC value 0.961 versus 0.966).

Boosting Medical Image-based Cancer Detection via Text-guided Supervision from Reports

TL;DR

The paper tackles the annotation bottleneck in CT-based cancer screening by leveraging clinical reports as weak supervision via a text-guided weakly semi-supervised framework that embeds diagnosis and tumor location prompts into a vision-language model. It introduces a two-step training pipeline: a segmentation teacher that generates pseudo masks and a cancer-detection student that learns from both full and pseudo annotations, with text-guided losses operating in the CLIP latent space to stabilize learning. Evaluated on a large esophageal cancer dataset of 1,651 patients, the method achieves AUC 0.961 using only 30% fully annotated data, closely matching the 0.966 of fully supervised models and reducing annotation effort by at least 70%. Ablation studies confirm the value of text-guided cues and prompt design, demonstrating improved pseudo masks, robust joint segmentation-detection performance, and competitive results against strong baselines. The approach offers a scalable path for text-informed, low-label cancer screening applicable to multiple organ sites and imaging modalities.

Abstract

The absence of adequately sufficient expert-level tumor annotations hinders the effectiveness of supervised learning based opportunistic cancer screening on medical imaging. Clinical reports (that are rich in descriptive textual details) can offer a "free lunch'' supervision information and provide tumor location as a type of weak label to cope with screening tasks, thus saving human labeling workloads, if properly leveraged. However, predicting cancer only using such weak labels can be very changeling since tumors are usually presented in small anatomical regions compared to the whole 3D medical scans. Weakly semi-supervised learning (WSSL) utilizes a limited set of voxel-level tumor annotations and incorporates alongside a substantial number of medical images that have only off-the-shelf clinical reports, which may strike a good balance between minimizing expert annotation workload and optimizing screening efficacy. In this paper, we propose a novel text-guided learning method to achieve highly accurate cancer detection results. Through integrating diagnostic and tumor location text prompts into the text encoder of a vision-language model (VLM), optimization of weakly supervised learning can be effectively performed in the latent space of VLM, thereby enhancing the stability of training. Our approach can leverage clinical knowledge by large-scale pre-trained VLM to enhance generalization ability, and produce reliable pseudo tumor masks to improve cancer detection. Our extensive quantitative experimental results on a large-scale cancer dataset, including 1,651 unique patients, validate that our approach can reduce human annotation efforts by at least 70% while maintaining comparable cancer detection accuracy to competing fully supervised methods (AUC value 0.961 versus 0.966).
Paper Structure (13 sections, 7 equations, 10 figures, 8 tables)

This paper contains 13 sections, 7 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Illustration of available different types of annotations or labels for non-contrast CT-based Esophageal Cancer (EC) screening. Tumor annotating can be a labor-intensive process, if it involves carefully labeling the tumor mask on each axial CT slice. In contrast, clinical reports offer readily accessible patient-level and location-specific information, with the possibility of eliminating the need for time-consuming expert annotation, at least partially.
  • Figure 2: Overall framework of our text-guided weakly semi-supervised cancer detection framework. Step 1: Training a tumor segmentation teacher network on a subset of fully annotated images (e.g.30%) for detailed cancer characteristics. Then, leverage this network to create pseudo tumor masks for the remaining weakly-supervised images (e.g.70%) that only have clinical reports. Step 2: Using all training images to train a cancer detection student network. Notably: our text-guided learning scheme, which utilizes weak labels, serves dual purposes: it mitigates the overfitting associated with limited training data in Step 1, and it compensates for the imprecision of pseudo tumor masks in Step 2.
  • Figure 3: Illustration of our proposed text-guided cancer detection framework. (a) A joint learning framework of tumor segmentation and cancer detection, multi-scale features from the decoder of a segmentation network are aggregated together and fed to a classification head to obtain the final screening results. (b) A text-guided learning scheme that can mine information from texts as weak labels in clinical records, including diagnostic labels and tumor location labels.
  • Figure 4: ROC curves of fully-supervised and WSSL model settings.
  • Figure 5: Ablation study on step 1 tumor segmentation teacher model. We report Dice scores of tumors in Esophagus (ESO) and Esophagogastric Junction (EGJ), respectively.
  • ...and 5 more figures