Grounded Knowledge-Enhanced Medical Vision-Language Pre-training for Chest X-Ray
Qiao Deng, Zhongzhen Huang, Yunqi Wang, Zhichuan Wang, Zhao Wang, Xiaofan Zhang, Qi Dou, Yeung Yu Hui, Edward S. Hui
TL;DR
This paper tackles the challenge of robust cross-modal learning for chest X-rays by proposing GK-MVLP, a grounded knowledge-enhanced medical vision-language pre-training framework. It grounds entity-level medical knowledge to anatomical regions and fuses region-aware prompts with global visual features through a transformer-based GK module, paired with four pre-training objectives (ITC, ITM, LM, ECLS). GK-MVLP achieves state-of-the-art or competitive performance across downstream tasks including disease classification, localization, radiology report generation, and medical VQA, while using a relatively small pretraining corpus. The results demonstrate that explicit grounding and region-entity alignment can reduce bias and improve fine-grained understanding, offering a robust foundation for multi-modal medical AI applications and potential extension to additional imaging modalities and language models.
Abstract
Medical foundation models have the potential to revolutionize healthcare by providing robust and generalized representations of medical data. Medical vision-language pre-training has emerged as a promising approach for learning domain-general representations of medical image and text. Current algorithms that exploit global and local alignment between medical image and text could however be marred by redundant information in medical data. To address this issue, we propose a grounded knowledge-enhanced medical vision-language pre-training (GK-MVLP) framework for chest X-ray. In this framework, medical knowledge was grounded to the appropriate anatomical regions by using a transformer-based grounded knowledge-enhanced module for fine-grained alignment between textural features of medical knowledge and the corresponding anatomical region-level visual features. The performance of GK-MVLP was competitive with or exceeded the state of the art on downstream image understanding tasks (chest X-ray disease classification, disease localization), generative task (report generation), and vision-language understanding task (medical visual question-answering). Our results demonstrate the advantage of incorporating grounding mechanism to remove biases and improve the alignment between chest X-ray image and radiology report.
