Table of Contents
Fetching ...

CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios

Jingyang Lin, Yingda Xia, Jianpeng Zhang, Ke Yan, Kai Cao, Le Lu, Jiebo Luo, Ling Zhang

TL;DR

This work tackles the difficulty of learning from 3D CT scans with radiology reports by introducing CT-GLIP, a grounded language–image pretraining framework that aligns organ-level visual regions with textual descriptions. By constructing a large grounded CT–report dataset and employing anatomy and diagnosis contrastive learning plus an abnormality dictionary, CT-GLIP achieves strong zero-shot performance in organ recognition and abnormality detection and improves fine-tuning results for tumor detection and segmentation. The results demonstrate the importance of grounded cross-modal alignment in 3D medical VL foundations, enabling more precise and generalizable representations for full-body CT analysis. The framework offers a practical path toward automated, region-aware radiology interpretation and lays groundwork for extending grounded 3D VL pretraining to other modalities.

Abstract

3D medical vision-language (VL) pretraining has shown potential in radiology by leveraging large-scale multimodal datasets with CT-report pairs. However, existing methods primarily rely on a global VL alignment directly adapted from 2D scenarios. The entire 3D image is transformed into one global embedding, resulting in a loss of sparse but critical semantics essential for accurately aligning with the corresponding diagnosis. To address this limitation, we propose CT-GLIP, a 3D Grounded Language-Image Pretrained model that constructs fine-grained CT-report pairs to enhance \textit{grounded} cross-modal contrastive learning, effectively aligning grounded visual features with precise textual descriptions. Leveraging the grounded cross-modal alignment, CT-GLIP improves performance across diverse downstream tasks and can even identify organs and abnormalities in a zero-shot manner using natural language. CT-GLIP is trained on a multimodal CT dataset comprising 44,011 organ-level CT-report pairs from 17,702 patients, covering 104 organs. Evaluation is conducted on four downstream tasks: zero-shot organ recognition (OR), zero-shot abnormality detection (AD), tumor detection (TD), and tumor segmentation (TS). Empirical results show that it outperforms its counterparts with global VL alignment. Compared to vanilla CLIP, CT-GLIP achieves average performance improvements of 15.1% of F1 score, 1.9% of AUC, and 3.2% of DSC for zero-shot AD, TD, and TS tasks, respectively. This study highlights the significance of grounded VL alignment in enabling 3D medical VL foundation models to understand sparse representations within CT scans.

CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios

TL;DR

This work tackles the difficulty of learning from 3D CT scans with radiology reports by introducing CT-GLIP, a grounded language–image pretraining framework that aligns organ-level visual regions with textual descriptions. By constructing a large grounded CT–report dataset and employing anatomy and diagnosis contrastive learning plus an abnormality dictionary, CT-GLIP achieves strong zero-shot performance in organ recognition and abnormality detection and improves fine-tuning results for tumor detection and segmentation. The results demonstrate the importance of grounded cross-modal alignment in 3D medical VL foundations, enabling more precise and generalizable representations for full-body CT analysis. The framework offers a practical path toward automated, region-aware radiology interpretation and lays groundwork for extending grounded 3D VL pretraining to other modalities.

Abstract

3D medical vision-language (VL) pretraining has shown potential in radiology by leveraging large-scale multimodal datasets with CT-report pairs. However, existing methods primarily rely on a global VL alignment directly adapted from 2D scenarios. The entire 3D image is transformed into one global embedding, resulting in a loss of sparse but critical semantics essential for accurately aligning with the corresponding diagnosis. To address this limitation, we propose CT-GLIP, a 3D Grounded Language-Image Pretrained model that constructs fine-grained CT-report pairs to enhance \textit{grounded} cross-modal contrastive learning, effectively aligning grounded visual features with precise textual descriptions. Leveraging the grounded cross-modal alignment, CT-GLIP improves performance across diverse downstream tasks and can even identify organs and abnormalities in a zero-shot manner using natural language. CT-GLIP is trained on a multimodal CT dataset comprising 44,011 organ-level CT-report pairs from 17,702 patients, covering 104 organs. Evaluation is conducted on four downstream tasks: zero-shot organ recognition (OR), zero-shot abnormality detection (AD), tumor detection (TD), and tumor segmentation (TS). Empirical results show that it outperforms its counterparts with global VL alignment. Compared to vanilla CLIP, CT-GLIP achieves average performance improvements of 15.1% of F1 score, 1.9% of AUC, and 3.2% of DSC for zero-shot AD, TD, and TS tasks, respectively. This study highlights the significance of grounded VL alignment in enabling 3D medical VL foundation models to understand sparse representations within CT scans.
Paper Structure (21 sections, 7 equations, 5 figures, 2 tables)

This paper contains 21 sections, 7 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of our grounded CT-report multimodal dataset, the CT-GLIP framework, and evaluation protocols. a. The CT-GLIP framework is trained on 44,011 grounded CT-report pairs from 17,792 patients, covering 104 organs. This fine-grained multimodal dataset enables precise and effective contrastive learning for analyzing CT scans. b. Zero-shot evaluation includes 104-way organ recognition and abnormality detection, conducted on test sets comprising 643 and 1,330 patients, respectively, and targeting 16 common abnormalities across 7 organs. c. Fine-tuning evaluation focuses on tumor detection and segmentation tasks.
  • Figure 2: Comprehensive statistics of our grounded CT-report multimodal dataset. a. Gender distribution, with 62.3% male and 37.7% female patients. b. Age distribution depicted as a histogram with ten-year bins and an overlaid Gaussian kernel–density estimate, indicating a broad adult cohort with an average age of approximately 50 years. c. Word count distribution of the raw radiology reports. d. Anatomic region frequencies in log scale. e. Word cloud of the raw reports, where font size reflects term frequency. f. Organ frequencies (log scale) in the training set. g. Organ frequencies (log scale) in the test set. h. Word count of the structured abnormality diagnoses for training (blue) and test (red) sets presents a similar distribution. i. Word clouds of the abnormality diagnoses for the training (left) and test (right) sets.
  • Figure 3: Overview of the CT-GLIP framework, consisting of anatomy and diagnosis contrastive learning. a.Anatomy contrastive learning obtains grounded CT embedding via organ-level pooling with pseudo masks. Then, it pairs each organ with a template description encoded by an expert text encoder to learn anatomical VL alignment. b.Diagnosis contrastive learning builds diagnosis descriptions for each organ (real findings for abnormal, templated "no evident abnormality" for normal). Moreover, an additional abnormality dictionary increases the diversity of abnormality descriptions, expanding contrastive pairs and improving discrimination between abnormality and non-abnormality.
  • Figure 4: Zero-shot Evaluation. a. At zero-shot organ recognition inference, 104 organ descriptions are templated and encoded by an expert text encoder. Meanwhile, organ-level visual embeddings are extracted from grounded CT scans by a 3D encoder. Then, the nearest text–image embedding towards a given grounded visual embedding determines the prediction, enabling 104-way zero-shot classification. b. Top-1 accuracy with CNN and ViT backbones shows that CT-GLIP achieves strong performance on zero-shot organ recognition, whereas the global VL-alignment baseline (the Vanilla CLIP) struggles on this task. c. Zero-shot abnormality detection inference pipeline: organ-level features are contrasted against abnormality and non-abnormality prompts to determine the abnormality label. d-g. Zero-shot abnormality detection across CNN- and ViT-based encoders evaluated by F1, Positive Predictive Value, Sensitivity, and AUC: CT-GLIP and its variants ($\dagger$ and $\ddagger$) improve over the Vanilla CLIP, with the full CT-GLIP achieving the strongest results. h. Per-organ AUC radar plot highlights consistent improvements of CT-GLIP over the Vanilla CLIP across all seven organs.
  • Figure 5: Fine-tuning Evaluation. a. Per-organ tumor detection is evaluated with a CNN backbone (nnUNet) using AUC. Both pretraining strategies surpass training from scratch, and CT-GLIP consistently outperforms the Vanilla CLIP across most organs. b. Using a ViT backbone (MiT), we obtain the same pattern, highlighting backbone-agnostic benefits from pretraining and the stronger encoder learned by CT-GLIP. c. Overall performance on tumor detection of both CNN and ViT backbones confirms that pretraining provides clear improvements, and grounded VL alignment outperforms global VL alignment. d. Per-organ tumor segmentation with nnUNet, evaluated by Dice-Sørensen Coefficient (DSC), shows improvements from pretraining, and CT-GLIP provides additional gains beyond the Vanilla CLIP. e. Per-organ performance on tumor segmentation with MiT shows that CT-GLIP gains broad improvements on most organs over both Scratch and the vanilla CLIP baselines. f. Overall performance on tumor segmentation using both CNN and ViT backbones demonstrates the superiority of CT-GLIP.