A Lightweight Large Vision-language Model for Multimodal Medical Images
Belal Alsinglawi, Chris McCarthy, Sara Webb, Christopher Fluke, Navid Toosy Saidy
TL;DR
This work introduces a lightweight multimodal VQA model for medical imaging that fuses BiomedCLIP image features with LLaMA-3 for text processing, targeting real-world clinical utility. Trained in two stages with LoRA fine-tuning and joint alignment, the model achieves 73.4% accuracy on OmniMedVQA while using ~8B parameters and running on 2× NVIDIA A100 GPUs, offering a favorable efficiency-performance balance. Open-ended question handling and strong cross-modal alignment enable robust clinical QA across diverse imaging modalities, though MRI remains challenging and dataset repetitiveness suggests potential overfitting. The study demonstrates that resource-efficient, open-ended medical VQA is feasible, paving the way for practical deployment and further enhancements in generalization and reasoning capabilities.
Abstract
Medical Visual Question Answering (VQA) enhances clinical decision-making by enabling systems to interpret medical images and answer clinical queries. However, developing efficient, high-performance VQA models is challenging due to the complexity of medical imagery and diverse modalities. In this paper, we introduce a lightweight, multimodal VQA model integrating BiomedCLIP for image feature extraction and LLaMA-3 for text processing. Designed for medical VQA tasks, our model achieves state-of-the-art performance on the OmniMedVQA dataset. With approximately 8 billion parameters, it requires only two NVIDIA 40 GB A100 GPUs, demonstrating superior efficiency over larger models. Our results show 73.4% accuracy for open-end questions, surpassing existing models and validating its potential for real-world medical applications. Key contributions include a specialized multimodal VQA model, a resource-efficient architecture, and strong performance in answering open-ended clinical questions.
