Table of Contents
Fetching ...

Bridged Semantic Alignment for Zero-shot 3D Medical Image Diagnosis

Haoran Lai, Zihang Jiang, Qingsong Yao, Rongsheng Wang, Zhiyang He, Xiaodong Tao, Weifu Lv, Wei Wei, S. Kevin Zhou

TL;DR

BrgSA tackles zero-shot abnormality diagnosis in 3D CT by bridging the modality gap between image and text representations. It combines semantic summarization of radiology reports with a Cross-Modal Knowledge Bank (CMKB) that supports Cross-Modal Knowledge Interaction (CMKI), enabling implicit and explicit alignment through reconstruction and InfoNCE losses: $\mathcal{L}_{total} = \alpha \mathcal{L}_{MSE} + \beta \mathcal{L}_{INFO} + \gamma \mathcal{L}_{INFO-R}$. The method achieves state-of-the-art results on internal and external benchmarks, including the CT-RATE-LT long-tail dataset, and demonstrates strong retrieval performance and efficiency in 3D CT contexts. These contributions offer practical implications for diagnosing rare abnormalities and open-set generalization in clinical imaging workflows. BrgSA’s combination of semantic summarization and CMKI provides a scalable framework for robust vision-language alignment in 3D medical imaging.

Abstract

3D medical images such as computed tomography are widely used in clinical practice, offering a great potential for automatic diagnosis. Supervised learning-based approaches have achieved significant progress but rely heavily on extensive manual annotations, limited by the availability of training data and the diversity of abnormality types. Vision-language alignment (VLA) offers a promising alternative by enabling zero-shot learning without additional annotations. However, we empirically discover that the visual and textural embeddings after alignment endeavors from existing VLA methods form two well-separated clusters, presenting a wide gap to be bridged. To bridge this gap, we propose a Bridged Semantic Alignment (BrgSA) framework. First, we utilize a large language model to perform semantic summarization of reports, extracting high-level semantic information. Second, we design a Cross-Modal Knowledge Interaction module that leverages a cross-modal knowledge bank as a semantic bridge, facilitating interaction between the two modalities, narrowing the gap, and improving their alignment. To comprehensively evaluate our method, we construct a benchmark dataset that includes 15 underrepresented abnormalities as well as utilize two existing benchmark datasets. Experimental results demonstrate that BrgSA achieves state-of-the-art performances on both public benchmark datasets and our custom-labeled dataset, with significant improvements in zero-shot diagnosis of underrepresented abnormalities.

Bridged Semantic Alignment for Zero-shot 3D Medical Image Diagnosis

TL;DR

BrgSA tackles zero-shot abnormality diagnosis in 3D CT by bridging the modality gap between image and text representations. It combines semantic summarization of radiology reports with a Cross-Modal Knowledge Bank (CMKB) that supports Cross-Modal Knowledge Interaction (CMKI), enabling implicit and explicit alignment through reconstruction and InfoNCE losses: . The method achieves state-of-the-art results on internal and external benchmarks, including the CT-RATE-LT long-tail dataset, and demonstrates strong retrieval performance and efficiency in 3D CT contexts. These contributions offer practical implications for diagnosing rare abnormalities and open-set generalization in clinical imaging workflows. BrgSA’s combination of semantic summarization and CMKI provides a scalable framework for robust vision-language alignment in 3D medical imaging.

Abstract

3D medical images such as computed tomography are widely used in clinical practice, offering a great potential for automatic diagnosis. Supervised learning-based approaches have achieved significant progress but rely heavily on extensive manual annotations, limited by the availability of training data and the diversity of abnormality types. Vision-language alignment (VLA) offers a promising alternative by enabling zero-shot learning without additional annotations. However, we empirically discover that the visual and textural embeddings after alignment endeavors from existing VLA methods form two well-separated clusters, presenting a wide gap to be bridged. To bridge this gap, we propose a Bridged Semantic Alignment (BrgSA) framework. First, we utilize a large language model to perform semantic summarization of reports, extracting high-level semantic information. Second, we design a Cross-Modal Knowledge Interaction module that leverages a cross-modal knowledge bank as a semantic bridge, facilitating interaction between the two modalities, narrowing the gap, and improving their alignment. To comprehensively evaluate our method, we construct a benchmark dataset that includes 15 underrepresented abnormalities as well as utilize two existing benchmark datasets. Experimental results demonstrate that BrgSA achieves state-of-the-art performances on both public benchmark datasets and our custom-labeled dataset, with significant improvements in zero-shot diagnosis of underrepresented abnormalities.
Paper Structure (33 sections, 15 equations, 12 figures, 11 tables, 1 algorithm)

This paper contains 33 sections, 15 equations, 12 figures, 11 tables, 1 algorithm.

Figures (12)

  • Figure 1: UMAP visualization of features. Cosine similarity is used to evaluate the alignment between image and text features. The text features are generated using generic descriptive texts to ensure that all images can be matched to all texts. (a) Features generated using pretrained weights without vision-language alignment, where image (in blue) and text features (in red) remain unaligned. The corresponding pre-training protocols are detailed in "\ref{['sec:ID']}". (b) Features after vision-language alignment using 3D CLIP, showing improved alignment but with noticeable modality gaps. (c) Features after vision-language alignment using BrgSA framework, where the CMKB features (in green) serve as a bridge to reduce the modality gap and further enhance feature alignment. CMKB denotes the Cross-Modal Knowledge Bank, whereas BrgSA abbreviates Bridged Semantic Alignment.
  • Figure 2: Illustration of the proposed BrgSA network, which integrates semantic summarization and cross-modal knowledge interaction (CMKI). First, we leverage a large language model (LLM) to summarize the report, generating outputs in a fixed template. These summarized reports, along with the original reports, serve as the textual inputs. Then, image and text features are extracted by respective encoders and fed into CMKI module to obtain interaction features. Finally, the interaction features are constrained using an MSE loss, while alignment optimization is achieved via an InfoNCE loss.
  • Figure 3: Prompt for LLM used in semantic summarization for reports.
  • Figure 4: Histogram of abnormality frequencies for CT-RATE-LT.
  • Figure 5: Mapping of 27 abnormalities from RAD-ChestCT to 18 abnormalities in CT-RATE. The abnormalities from CT-RATE are denoted in blue font, whereas the abnormalities from RAD-ChestCT are denoted in black font.
  • ...and 7 more figures