Fusion of Domain-Adapted Vision and Language Models for Medical Visual Question Answering
Cuong Nhat Ha, Shima Asaadi, Sanjeev Kumar Karn, Oladimeji Farri, Tobias Heimann, Thomas Runkler
TL;DR
Medical Visual Question Answering (MedVQA) suffers when applying general vision-language models to domain-specific clinical scenarios. The authors propose a domain-adapted VLM that fuses a radiology-domain LLM (RadBloomz-7b) with biomedical vision encoders (e.g., BiomedCLIP-ViT/PMC-CLIP) via a learnable fusion module and a three-stage, LoRA-based training regime, formalized by $L(\Theta) = - \sum_{t=1}^{T} \log p(a_t | v, q, a_{1:t-1}; \Theta)$. Empirical results show state-of-the-art SLAKE 1.0 English accuracy of 87.5% and strong VQA-RAD accuracy of 73.2%, with ablations indicating a ~25% improvement from the full pretraining pipeline and clear gains from radiology-domain LLM over general-domain baselines. The work demonstrates that domain-specific LLMs and medical vision encoders can yield substantial performance gains in radiology VQA while maintaining a parameter-efficient training paradigm, offering a practical pathway for domain-specialized multimodal medical AI.
Abstract
Vision-language models, while effective in general domains and showing strong performance in diverse multi-modal applications like visual question-answering (VQA), struggle to maintain the same level of effectiveness in more specialized domains, e.g., medical. We propose a medical vision-language model that integrates large vision and language models adapted for the medical domain. This model goes through three stages of parameter-efficient training using three separate biomedical and radiology multi-modal visual and text datasets. The proposed model achieves state-of-the-art performance on the SLAKE 1.0 medical VQA (MedVQA) dataset with an overall accuracy of 87.5% and demonstrates strong performance on another MedVQA dataset, VQA-RAD, achieving an overall accuracy of 73.2%.
