CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios
Jingyang Lin, Yingda Xia, Jianpeng Zhang, Ke Yan, Kai Cao, Le Lu, Jiebo Luo, Ling Zhang
TL;DR
This work tackles the difficulty of learning from 3D CT scans with radiology reports by introducing CT-GLIP, a grounded language–image pretraining framework that aligns organ-level visual regions with textual descriptions. By constructing a large grounded CT–report dataset and employing anatomy and diagnosis contrastive learning plus an abnormality dictionary, CT-GLIP achieves strong zero-shot performance in organ recognition and abnormality detection and improves fine-tuning results for tumor detection and segmentation. The results demonstrate the importance of grounded cross-modal alignment in 3D medical VL foundations, enabling more precise and generalizable representations for full-body CT analysis. The framework offers a practical path toward automated, region-aware radiology interpretation and lays groundwork for extending grounded 3D VL pretraining to other modalities.
Abstract
3D medical vision-language (VL) pretraining has shown potential in radiology by leveraging large-scale multimodal datasets with CT-report pairs. However, existing methods primarily rely on a global VL alignment directly adapted from 2D scenarios. The entire 3D image is transformed into one global embedding, resulting in a loss of sparse but critical semantics essential for accurately aligning with the corresponding diagnosis. To address this limitation, we propose CT-GLIP, a 3D Grounded Language-Image Pretrained model that constructs fine-grained CT-report pairs to enhance \textit{grounded} cross-modal contrastive learning, effectively aligning grounded visual features with precise textual descriptions. Leveraging the grounded cross-modal alignment, CT-GLIP improves performance across diverse downstream tasks and can even identify organs and abnormalities in a zero-shot manner using natural language. CT-GLIP is trained on a multimodal CT dataset comprising 44,011 organ-level CT-report pairs from 17,702 patients, covering 104 organs. Evaluation is conducted on four downstream tasks: zero-shot organ recognition (OR), zero-shot abnormality detection (AD), tumor detection (TD), and tumor segmentation (TS). Empirical results show that it outperforms its counterparts with global VL alignment. Compared to vanilla CLIP, CT-GLIP achieves average performance improvements of 15.1% of F1 score, 1.9% of AUC, and 3.2% of DSC for zero-shot AD, TD, and TS tasks, respectively. This study highlights the significance of grounded VL alignment in enabling 3D medical VL foundation models to understand sparse representations within CT scans.
