Anatomical Structure-Guided Medical Vision-Language Pre-training

Qingqiu Li; Xiaohan Yan; Jilan Xu; Runtian Yuan; Yuejie Zhang; Rui Feng; Quanli Shen; Xiaobo Zhang; Shujun Wang

Anatomical Structure-Guided Medical Vision-Language Pre-training

Qingqiu Li, Xiaohan Yan, Jilan Xu, Runtian Yuan, Yuejie Zhang, Rui Feng, Quanli Shen, Xiaobo Zhang, Shujun Wang

TL;DR

The paper tackles interpretability and cross-modal learning gaps in medical vision–language pre-training by introducing Anatomical Structure-Guided (ASG) learning, which parses radiology reports into triplets $<\textit{anatomical region}, \textit{finding}, \textit{existence}>$ and jointly optimizes global and local cross-modal alignment. It comprises three components: Image-Report Alignment (IRA) for global matching, Anatomical Region-Sentence Alignment (ARSA) for fine-grained region–sentence correspondences guided by radiologists, and Internal and External Representation Learning (IERL) that leverages an image-tag decoder and soft-label contrastive learning to better connect images and reports. The model pre-trains on MIMIC-CXR with $\approx 2.17\times 10^5$ pairs and evaluates on five public benchmarks, reporting state-of-the-art or competitive results on both classification and segmentation tasks, with notable gains in COVID-19 studies and improved localization in ARSA. Overall, ASG advances clinically relevant, interpretable VLP by integrating anatomical structure, discourse-level alignment, and robust cross-document semantics, with potential for broader radiology applications.

Abstract

Learning medical visual representations through vision-language pre-training has reached remarkable progress. Despite the promising performance, it still faces challenges, i.e., local alignment lacks interpretability and clinical relevance, and the insufficient internal and external representation learning of image-report pairs. To address these issues, we propose an Anatomical Structure-Guided (ASG) framework. Specifically, we parse raw reports into triplets <anatomical region, finding, existence>, and fully utilize each element as supervision to enhance representation learning. For anatomical region, we design an automatic anatomical region-sentence alignment paradigm in collaboration with radiologists, considering them as the minimum semantic units to explore fine-grained local alignment. For finding and existence, we regard them as image tags, applying an image-tag recognition decoder to associate image features with their respective tags within each sample and constructing soft labels for contrastive learning to improve the semantic association of different image-report pairs. We evaluate the proposed ASG framework on two downstream tasks, including five public benchmarks. Experimental results demonstrate that our method outperforms the state-of-the-art methods.

Anatomical Structure-Guided Medical Vision-Language Pre-training

TL;DR

and jointly optimizes global and local cross-modal alignment. It comprises three components: Image-Report Alignment (IRA) for global matching, Anatomical Region-Sentence Alignment (ARSA) for fine-grained region–sentence correspondences guided by radiologists, and Internal and External Representation Learning (IERL) that leverages an image-tag decoder and soft-label contrastive learning to better connect images and reports. The model pre-trains on MIMIC-CXR with

pairs and evaluates on five public benchmarks, reporting state-of-the-art or competitive results on both classification and segmentation tasks, with notable gains in COVID-19 studies and improved localization in ARSA. Overall, ASG advances clinically relevant, interpretable VLP by integrating anatomical structure, discourse-level alignment, and robust cross-document semantics, with potential for broader radiology applications.

Abstract

Paper Structure (9 sections, 7 equations, 4 figures, 3 tables)

This paper contains 9 sections, 7 equations, 4 figures, 3 tables.

Introduction
Methodology
Image-Report Alignment (IRA)
Anatomical Region-Sentence Alignment (ARSA)
Internal and External Representation Learning (IERL)
Experiments
Experimental Setting
Experimental Results
Conclusion

Figures (4)

Figure 1: Two limitations of existing methods: (a) lack of interpretability and clinical relevance and (b) insufficient representation learning of image-report pairs; and our corresponding improvement.
Figure 2: Overview of our ASG framework. (a) Exploring local alignment between image-text pairs through anatomical region-sentence alignment. (b) Optimizing internal representation learning by applying an image-tag recognition decoder to associate image features with their respective tags. (c) Optimizing external representation learning by constructing soft labels for contrastive learning to mitigate false negatives.
Figure 2: Comparison with other SOTA methods on segmentation task.
Figure 3: Heat maps of the vision-language association learned by ASG, compared with GT annotations provided by radiologists.

Anatomical Structure-Guided Medical Vision-Language Pre-training

TL;DR

Abstract

Anatomical Structure-Guided Medical Vision-Language Pre-training

Authors

TL;DR

Abstract

Table of Contents

Figures (4)