T3D: Advancing 3D Medical Vision-Language Pre-training by Learning Multi-View Visual Consistency
Che Liu, Cheng Ouyang, Yinda Chen, Cesar César Quilodrán-Casas, Lei Ma, Jie Fu, Yike Guo, Anand Shah, Wenjia Bai, Rossella Arcucci
TL;DR
This work addresses the limited supervision for 3D medical imaging by introducing CT-3DVLP, the first large public volume-report dataset, and the T3D framework that couples global cross-modal alignment with a text-informed multi-view strategy. By learning both global representations and text-guided local features from multiple crops per volume, T3D surpasses prior 3D MedVLP and vSSL methods across classification, retrieval, report generation, and segmentation tasks. The approach demonstrates strong zero-shot and fine-tuning performance, indicating improved generalization and clinical relevance, while the public dataset enables open research and benchmarking. Overall, CT-3DVLP and T3D set a new benchmark for 3D medical image understanding and multimodal learning in healthcare.
Abstract
While 3D visual self-supervised learning (vSSL) shows promising results in capturing visual representations, it overlooks the clinical knowledge from radiology reports. Meanwhile, 3D medical vision-language pre-training (MedVLP) remains underexplored due to the lack of a large-scale, publicly available 3D medical image-report dataset. To bridge this gap, we introduce **CT-3DVLP**, the first and largest **public** 3D volume-report dataset, establishing a comprehensive benchmark for 3D MedVLP research. Meanwhile, we propose the **T3D** framework, which enhances 3D MedVLP beyond naive CLIP-style alignment that directly pairs volumes with reports but neglects local visual representations. Instead, we introduce **Text-informed Multi-view Alignment (TMA)**, a novel approach that clusters volumetric data while enforcing consistency across different views of the same volume-report pair. TMA integrates textual features into fine-grained visual representations, ensuring contextual coherence across views. We evaluate T3D across multiple downstream tasks in both unimodal and cross-modal settings, including zero-shot and fine-tuned classification, cross-modal retrieval, report generation, and semantic segmentation. Our results show that T3D consistently outperforms existing vSSL and multimodal methods, demonstrating superior zero-shot and fine-tuning capabilities and setting a new benchmark for 3D medical image understanding.
