T3D: Advancing 3D Medical Vision-Language Pre-training by Learning Multi-View Visual Consistency

Che Liu; Cheng Ouyang; Yinda Chen; Cesar César Quilodrán-Casas; Lei Ma; Jie Fu; Yike Guo; Anand Shah; Wenjia Bai; Rossella Arcucci

T3D: Advancing 3D Medical Vision-Language Pre-training by Learning Multi-View Visual Consistency

Che Liu, Cheng Ouyang, Yinda Chen, Cesar César Quilodrán-Casas, Lei Ma, Jie Fu, Yike Guo, Anand Shah, Wenjia Bai, Rossella Arcucci

TL;DR

This work addresses the limited supervision for 3D medical imaging by introducing CT-3DVLP, the first large public volume-report dataset, and the T3D framework that couples global cross-modal alignment with a text-informed multi-view strategy. By learning both global representations and text-guided local features from multiple crops per volume, T3D surpasses prior 3D MedVLP and vSSL methods across classification, retrieval, report generation, and segmentation tasks. The approach demonstrates strong zero-shot and fine-tuning performance, indicating improved generalization and clinical relevance, while the public dataset enables open research and benchmarking. Overall, CT-3DVLP and T3D set a new benchmark for 3D medical image understanding and multimodal learning in healthcare.

Abstract

While 3D visual self-supervised learning (vSSL) shows promising results in capturing visual representations, it overlooks the clinical knowledge from radiology reports. Meanwhile, 3D medical vision-language pre-training (MedVLP) remains underexplored due to the lack of a large-scale, publicly available 3D medical image-report dataset. To bridge this gap, we introduce **CT-3DVLP**, the first and largest **public** 3D volume-report dataset, establishing a comprehensive benchmark for 3D MedVLP research. Meanwhile, we propose the **T3D** framework, which enhances 3D MedVLP beyond naive CLIP-style alignment that directly pairs volumes with reports but neglects local visual representations. Instead, we introduce **Text-informed Multi-view Alignment (TMA)**, a novel approach that clusters volumetric data while enforcing consistency across different views of the same volume-report pair. TMA integrates textual features into fine-grained visual representations, ensuring contextual coherence across views. We evaluate T3D across multiple downstream tasks in both unimodal and cross-modal settings, including zero-shot and fine-tuned classification, cross-modal retrieval, report generation, and semantic segmentation. Our results show that T3D consistently outperforms existing vSSL and multimodal methods, demonstrating superior zero-shot and fine-tuning capabilities and setting a new benchmark for 3D medical image understanding.

T3D: Advancing 3D Medical Vision-Language Pre-training by Learning Multi-View Visual Consistency

TL;DR

Abstract

Paper Structure (24 sections, 15 equations, 5 figures, 5 tables)

This paper contains 24 sections, 15 equations, 5 figures, 5 tables.

Introduction
Related Work
Method
Extracting Visual and Text Features
Global Cross-Modal Alignment
Text-Informed Multi-View Alignment
Motivation.
Generating Local Views.
Text-Informed Local Feature Enhancement.
Multi-View Alignment.
Overall Objective
Experiments
Pre-training Dataset.
Pre-training Implementation.
Downstream Tasks Configuration
...and 9 more sections

Figures (5)

Figure 1: Illustration of the Text-Informed Multi-View Alignment (TMA) method. Multiple local views $V_i^m$ are generated from the same 3D volume, and their embeddings are aligned in the latent space to encourage consistency across views from the same volume-report pair. Each view's embedding is refined by the corresponding report to ensure consistency among all views from the same volume. The details are illustrated in Section \ref{['sec:tma']}.
Figure 2: The T3D framework for learning multi-level 3D visual representations from corresponding medical reports. Left: To learn global cross-modal representations, we align the full 3D volume $V_i$ with its corresponding medical report $R_i$ using the loss function $\mathcal{L}_{\textrm{GCA}}$. The output embeddings $\mathbf{z}_i^v$ and $\mathbf{z}_i^r$ are optimized to encourage the matching of paired visual and textual features. Right: To further capture fine-grained visual representations, we first generate $M$ local views $V_i^m$ from the same volume using random cropping. The same visual encoder, as used in the GCA framework, is applied to obtain the embeddings for these local views. We then refine these embeddings using the report embedding $\mathbf{T}_i$, encouraging the local views from the same volume-report pair to become more similar in the latent space by minimizing the loss $\mathcal{L}_{\textrm{TMA}}$.
Figure 3: Comparison of T3D (Ours) and CT-CLIP ct-rate across six tasks, showing AUC, Dice, RaTES score, and R@1 for varying pre-training data scales from 10k to the full dataset. T3D consistently outperforms CT-CLIP across all data scales and tasks, particularly with larger datasets.
Figure 4: Report generation results of Merlin blankemeier2024merlin and T3D (Ours). Text highlighted in the same color indicates correct predictions, while bold and underlined text marks incorrect parts. Merlin shows incorrect patterns in various areas, whereas T3D provides more accurate results, particularly in the detection of lymph nodes and other pathologies.
Figure 5: Performance of T3D pre-trained on the proposed CT-3DVLP dataset across six tasks, with varying model scales: ResNet18, ResNet34, and ResNet50. The results show consistent performance improvement as the model scale increases.

T3D: Advancing 3D Medical Vision-Language Pre-training by Learning Multi-View Visual Consistency

TL;DR

Abstract

T3D: Advancing 3D Medical Vision-Language Pre-training by Learning Multi-View Visual Consistency

Authors

TL;DR

Abstract

Table of Contents

Figures (5)