Table of Contents
Fetching ...

Fusion of Domain-Adapted Vision and Language Models for Medical Visual Question Answering

Cuong Nhat Ha, Shima Asaadi, Sanjeev Kumar Karn, Oladimeji Farri, Tobias Heimann, Thomas Runkler

TL;DR

Medical Visual Question Answering (MedVQA) suffers when applying general vision-language models to domain-specific clinical scenarios. The authors propose a domain-adapted VLM that fuses a radiology-domain LLM (RadBloomz-7b) with biomedical vision encoders (e.g., BiomedCLIP-ViT/PMC-CLIP) via a learnable fusion module and a three-stage, LoRA-based training regime, formalized by $L(\Theta) = - \sum_{t=1}^{T} \log p(a_t | v, q, a_{1:t-1}; \Theta)$. Empirical results show state-of-the-art SLAKE 1.0 English accuracy of 87.5% and strong VQA-RAD accuracy of 73.2%, with ablations indicating a ~25% improvement from the full pretraining pipeline and clear gains from radiology-domain LLM over general-domain baselines. The work demonstrates that domain-specific LLMs and medical vision encoders can yield substantial performance gains in radiology VQA while maintaining a parameter-efficient training paradigm, offering a practical pathway for domain-specialized multimodal medical AI.

Abstract

Vision-language models, while effective in general domains and showing strong performance in diverse multi-modal applications like visual question-answering (VQA), struggle to maintain the same level of effectiveness in more specialized domains, e.g., medical. We propose a medical vision-language model that integrates large vision and language models adapted for the medical domain. This model goes through three stages of parameter-efficient training using three separate biomedical and radiology multi-modal visual and text datasets. The proposed model achieves state-of-the-art performance on the SLAKE 1.0 medical VQA (MedVQA) dataset with an overall accuracy of 87.5% and demonstrates strong performance on another MedVQA dataset, VQA-RAD, achieving an overall accuracy of 73.2%.

Fusion of Domain-Adapted Vision and Language Models for Medical Visual Question Answering

TL;DR

Medical Visual Question Answering (MedVQA) suffers when applying general vision-language models to domain-specific clinical scenarios. The authors propose a domain-adapted VLM that fuses a radiology-domain LLM (RadBloomz-7b) with biomedical vision encoders (e.g., BiomedCLIP-ViT/PMC-CLIP) via a learnable fusion module and a three-stage, LoRA-based training regime, formalized by . Empirical results show state-of-the-art SLAKE 1.0 English accuracy of 87.5% and strong VQA-RAD accuracy of 73.2%, with ablations indicating a ~25% improvement from the full pretraining pipeline and clear gains from radiology-domain LLM over general-domain baselines. The work demonstrates that domain-specific LLMs and medical vision encoders can yield substantial performance gains in radiology VQA while maintaining a parameter-efficient training paradigm, offering a practical pathway for domain-specialized multimodal medical AI.

Abstract

Vision-language models, while effective in general domains and showing strong performance in diverse multi-modal applications like visual question-answering (VQA), struggle to maintain the same level of effectiveness in more specialized domains, e.g., medical. We propose a medical vision-language model that integrates large vision and language models adapted for the medical domain. This model goes through three stages of parameter-efficient training using three separate biomedical and radiology multi-modal visual and text datasets. The proposed model achieves state-of-the-art performance on the SLAKE 1.0 medical VQA (MedVQA) dataset with an overall accuracy of 87.5% and demonstrates strong performance on another MedVQA dataset, VQA-RAD, achieving an overall accuracy of 73.2%.
Paper Structure (15 sections, 3 equations, 2 figures, 8 tables)

This paper contains 15 sections, 3 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Overview of the proposed vision-language (VLM) architecture for MedVQA task. The output from the biomedical-adapted vision encoder component is combined with the input question, processed through a Radiology-adapted Language Model (LLM). Learned queries are initiated from scratch and trained during our proposed alignment training of multi-modal domain adapted models, which includes image-caption pretraining, synthetic biomedical MQA, and MedVQA datasets, all fine-tuned using a parameter efficient LoRA technique.
  • Figure 2: Image examples from VQA-RAD corresponding to questions in Table \ref{['qualitative']}.