Bridged Semantic Alignment for Zero-shot 3D Medical Image Diagnosis
Haoran Lai, Zihang Jiang, Qingsong Yao, Rongsheng Wang, Zhiyang He, Xiaodong Tao, Weifu Lv, Wei Wei, S. Kevin Zhou
TL;DR
BrgSA tackles zero-shot abnormality diagnosis in 3D CT by bridging the modality gap between image and text representations. It combines semantic summarization of radiology reports with a Cross-Modal Knowledge Bank (CMKB) that supports Cross-Modal Knowledge Interaction (CMKI), enabling implicit and explicit alignment through reconstruction and InfoNCE losses: $\mathcal{L}_{total} = \alpha \mathcal{L}_{MSE} + \beta \mathcal{L}_{INFO} + \gamma \mathcal{L}_{INFO-R}$. The method achieves state-of-the-art results on internal and external benchmarks, including the CT-RATE-LT long-tail dataset, and demonstrates strong retrieval performance and efficiency in 3D CT contexts. These contributions offer practical implications for diagnosing rare abnormalities and open-set generalization in clinical imaging workflows. BrgSA’s combination of semantic summarization and CMKI provides a scalable framework for robust vision-language alignment in 3D medical imaging.
Abstract
3D medical images such as computed tomography are widely used in clinical practice, offering a great potential for automatic diagnosis. Supervised learning-based approaches have achieved significant progress but rely heavily on extensive manual annotations, limited by the availability of training data and the diversity of abnormality types. Vision-language alignment (VLA) offers a promising alternative by enabling zero-shot learning without additional annotations. However, we empirically discover that the visual and textural embeddings after alignment endeavors from existing VLA methods form two well-separated clusters, presenting a wide gap to be bridged. To bridge this gap, we propose a Bridged Semantic Alignment (BrgSA) framework. First, we utilize a large language model to perform semantic summarization of reports, extracting high-level semantic information. Second, we design a Cross-Modal Knowledge Interaction module that leverages a cross-modal knowledge bank as a semantic bridge, facilitating interaction between the two modalities, narrowing the gap, and improving their alignment. To comprehensively evaluate our method, we construct a benchmark dataset that includes 15 underrepresented abnormalities as well as utilize two existing benchmark datasets. Experimental results demonstrate that BrgSA achieves state-of-the-art performances on both public benchmark datasets and our custom-labeled dataset, with significant improvements in zero-shot diagnosis of underrepresented abnormalities.
