Table of Contents
Fetching ...

Grounded Knowledge-Enhanced Medical Vision-Language Pre-training for Chest X-Ray

Qiao Deng, Zhongzhen Huang, Yunqi Wang, Zhichuan Wang, Zhao Wang, Xiaofan Zhang, Qi Dou, Yeung Yu Hui, Edward S. Hui

TL;DR

This paper tackles the challenge of robust cross-modal learning for chest X-rays by proposing GK-MVLP, a grounded knowledge-enhanced medical vision-language pre-training framework. It grounds entity-level medical knowledge to anatomical regions and fuses region-aware prompts with global visual features through a transformer-based GK module, paired with four pre-training objectives (ITC, ITM, LM, ECLS). GK-MVLP achieves state-of-the-art or competitive performance across downstream tasks including disease classification, localization, radiology report generation, and medical VQA, while using a relatively small pretraining corpus. The results demonstrate that explicit grounding and region-entity alignment can reduce bias and improve fine-grained understanding, offering a robust foundation for multi-modal medical AI applications and potential extension to additional imaging modalities and language models.

Abstract

Medical foundation models have the potential to revolutionize healthcare by providing robust and generalized representations of medical data. Medical vision-language pre-training has emerged as a promising approach for learning domain-general representations of medical image and text. Current algorithms that exploit global and local alignment between medical image and text could however be marred by redundant information in medical data. To address this issue, we propose a grounded knowledge-enhanced medical vision-language pre-training (GK-MVLP) framework for chest X-ray. In this framework, medical knowledge was grounded to the appropriate anatomical regions by using a transformer-based grounded knowledge-enhanced module for fine-grained alignment between textural features of medical knowledge and the corresponding anatomical region-level visual features. The performance of GK-MVLP was competitive with or exceeded the state of the art on downstream image understanding tasks (chest X-ray disease classification, disease localization), generative task (report generation), and vision-language understanding task (medical visual question-answering). Our results demonstrate the advantage of incorporating grounding mechanism to remove biases and improve the alignment between chest X-ray image and radiology report.

Grounded Knowledge-Enhanced Medical Vision-Language Pre-training for Chest X-Ray

TL;DR

This paper tackles the challenge of robust cross-modal learning for chest X-rays by proposing GK-MVLP, a grounded knowledge-enhanced medical vision-language pre-training framework. It grounds entity-level medical knowledge to anatomical regions and fuses region-aware prompts with global visual features through a transformer-based GK module, paired with four pre-training objectives (ITC, ITM, LM, ECLS). GK-MVLP achieves state-of-the-art or competitive performance across downstream tasks including disease classification, localization, radiology report generation, and medical VQA, while using a relatively small pretraining corpus. The results demonstrate that explicit grounding and region-entity alignment can reduce bias and improve fine-grained understanding, offering a robust foundation for multi-modal medical AI applications and potential extension to additional imaging modalities and language models.

Abstract

Medical foundation models have the potential to revolutionize healthcare by providing robust and generalized representations of medical data. Medical vision-language pre-training has emerged as a promising approach for learning domain-general representations of medical image and text. Current algorithms that exploit global and local alignment between medical image and text could however be marred by redundant information in medical data. To address this issue, we propose a grounded knowledge-enhanced medical vision-language pre-training (GK-MVLP) framework for chest X-ray. In this framework, medical knowledge was grounded to the appropriate anatomical regions by using a transformer-based grounded knowledge-enhanced module for fine-grained alignment between textural features of medical knowledge and the corresponding anatomical region-level visual features. The performance of GK-MVLP was competitive with or exceeded the state of the art on downstream image understanding tasks (chest X-ray disease classification, disease localization), generative task (report generation), and vision-language understanding task (medical visual question-answering). Our results demonstrate the advantage of incorporating grounding mechanism to remove biases and improve the alignment between chest X-ray image and radiology report.
Paper Structure (21 sections, 13 equations, 3 figures, 6 tables)

This paper contains 21 sections, 13 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Comparison of different alignment methods for medical vision-language pre-training (VLP): previous models employ (a) global alignment by associating overall visual features with textual features, and (b) local alignment by connecting image patches with corresponding word features. (c) Our GK-MVLP aligns the global-local visual features (anatomical-region level) with medical knowledge features.
  • Figure 2: Illustration of the (a) pre-processing of medical knowledge prompts, and the architecture of the (b) grounded knowledge-enhanced (GK) module, where $f_{R}$ is projection head and (c) grounded knowledge-enhanced medical vision-language pre-training (GK-MVLP) framework.
  • Figure 3: Illustration of the (a) grounding mechanism and the pipeline of the (b) entity classification loss.