Table of Contents
Fetching ...

MG-3D: Multi-Grained Knowledge-Enhanced 3D Medical Vision-Language Pre-training

Xuefeng Ni, Linshan Wu, Jiaxin Zhuang, Qiong Wang, Mingxiang Wu, Varut Vardhanabhuti, Lihai Zhang, Hanyu Gao, Hao Chen

TL;DR

MG-3D tackles data-scarce 3D medical vision-language tasks by pre-training on volume-report data and explicitly modeling intra-patient multi-grained semantics alongside inter-patient semantic correlations. It introduces cross-modal global alignment and complementary local reconstruction for intra-patient learning, and sentence-based similarity matching with disentangled aggregation for inter-patient learning, all under a scalable multi-task framework. The approach achieves state-of-the-art performance across nine clinical tasks and demonstrates robust transfer to external datasets and modalities, with clear benefits from larger data and model capacity. Overall, MG-3D advances 3D medical VLP toward scalable, generalizable representations that leverage rich radiology reports for improved clinical decision support.

Abstract

3D medical image analysis is pivotal in numerous clinical applications. However, the scarcity of labeled data and limited generalization capabilities hinder the advancement of AI-empowered models. Radiology reports are easily accessible and can serve as weakly-supervised signals. However, large-scale vision-language pre-training (VLP) remains underexplored in 3D medical image analysis. Specifically, the insufficient investigation into multi-grained radiology semantics and their correlations across patients leads to underutilization of large-scale volume-report data. Considering intra-patient cross-modal semantic consistency and inter-patient semantic correlations, we propose a multi-task VLP method, MG-3D, pre-trained on large-scale data (47.1K), addressing the challenges by the following two aspects: 1) Establishing the correspondence between volume semantics and multi-grained medical knowledge of each patient with cross-modal global alignment and complementary modality-guided local reconstruction, ensuring intra-patient features of different modalities cohesively represent the same semantic content; 2) Correlating inter-patient visual semantics based on fine-grained report correlations across patients, and keeping sensitivity to global individual differences via contrastive learning, enhancing the discriminative feature representation. Furthermore, we delve into the scaling law to explore potential performance improvements. Comprehensive evaluations across nine uni- and cross-modal clinical tasks are carried out to assess model efficacy. Extensive experiments on both internal and external datasets demonstrate the superior transferability, scalability, and generalization of MG-3D, showcasing its potential in advancing feature representation for 3D medical image analysis. Code will be available: https://github.com/Xuefeng-Ni/MG-3D.

MG-3D: Multi-Grained Knowledge-Enhanced 3D Medical Vision-Language Pre-training

TL;DR

MG-3D tackles data-scarce 3D medical vision-language tasks by pre-training on volume-report data and explicitly modeling intra-patient multi-grained semantics alongside inter-patient semantic correlations. It introduces cross-modal global alignment and complementary local reconstruction for intra-patient learning, and sentence-based similarity matching with disentangled aggregation for inter-patient learning, all under a scalable multi-task framework. The approach achieves state-of-the-art performance across nine clinical tasks and demonstrates robust transfer to external datasets and modalities, with clear benefits from larger data and model capacity. Overall, MG-3D advances 3D medical VLP toward scalable, generalizable representations that leverage rich radiology reports for improved clinical decision support.

Abstract

3D medical image analysis is pivotal in numerous clinical applications. However, the scarcity of labeled data and limited generalization capabilities hinder the advancement of AI-empowered models. Radiology reports are easily accessible and can serve as weakly-supervised signals. However, large-scale vision-language pre-training (VLP) remains underexplored in 3D medical image analysis. Specifically, the insufficient investigation into multi-grained radiology semantics and their correlations across patients leads to underutilization of large-scale volume-report data. Considering intra-patient cross-modal semantic consistency and inter-patient semantic correlations, we propose a multi-task VLP method, MG-3D, pre-trained on large-scale data (47.1K), addressing the challenges by the following two aspects: 1) Establishing the correspondence between volume semantics and multi-grained medical knowledge of each patient with cross-modal global alignment and complementary modality-guided local reconstruction, ensuring intra-patient features of different modalities cohesively represent the same semantic content; 2) Correlating inter-patient visual semantics based on fine-grained report correlations across patients, and keeping sensitivity to global individual differences via contrastive learning, enhancing the discriminative feature representation. Furthermore, we delve into the scaling law to explore potential performance improvements. Comprehensive evaluations across nine uni- and cross-modal clinical tasks are carried out to assess model efficacy. Extensive experiments on both internal and external datasets demonstrate the superior transferability, scalability, and generalization of MG-3D, showcasing its potential in advancing feature representation for 3D medical image analysis. Code will be available: https://github.com/Xuefeng-Ni/MG-3D.

Paper Structure

This paper contains 43 sections, 15 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Overview of the 3D medical VLP framework. After collecting large-scale 3D volume-report data from diverse patient groups, the 3D vision encoder can learn radiology knowledge from reports by aligning semantic features across modalities. This framework can advance various clinical tasks, such as diagnosis, treatment, prognosis, and beyond.
  • Figure 2: Overview of the proposed framework: (a) The left section illustrates the intra-patient multi-grained semantics extraction, consisting of cross-modal global feature alignment (CML) and complementary modality-guided local information reconstruction (MIM, MLM, and SFR). (b) The right section depicts the inter-patient multi-grained semantics alignment, generating sentence-informed global visual features for different patients, and aligning these fine-grained features (SSM) and their aggregated global features across patients (DFA) via contrastive learning.
  • Figure 3: Complementary Modality-Guided Local Information Reconstruction: (a) With the guidance of sentence semantics, masked volumes are complemented via MIM. (b) Likewise, masked report words are reconstructed in MLM with visual assistance. (c) In SFR, the reconstructed word features are aggregated at the sentence level to ensure alignment with the original sentence features.
  • Figure 4: Cross-Modal Attention: (a) Classical mechanism: MIM is mainly dominated by text features; (b) The proposed cross-modal attention: MIM is primarily dominated by volume features.
  • Figure 5: Intra-Patient Cross-Modal Learning: The global complementary modality-informed features are aligned with uni-modal global features to infuse cross-modal knowledge into the 3D vision encoder.
  • ...and 3 more figures