Table of Contents
Fetching ...

A Lightweight Large Vision-language Model for Multimodal Medical Images

Belal Alsinglawi, Chris McCarthy, Sara Webb, Christopher Fluke, Navid Toosy Saidy

TL;DR

This work introduces a lightweight multimodal VQA model for medical imaging that fuses BiomedCLIP image features with LLaMA-3 for text processing, targeting real-world clinical utility. Trained in two stages with LoRA fine-tuning and joint alignment, the model achieves 73.4% accuracy on OmniMedVQA while using ~8B parameters and running on 2× NVIDIA A100 GPUs, offering a favorable efficiency-performance balance. Open-ended question handling and strong cross-modal alignment enable robust clinical QA across diverse imaging modalities, though MRI remains challenging and dataset repetitiveness suggests potential overfitting. The study demonstrates that resource-efficient, open-ended medical VQA is feasible, paving the way for practical deployment and further enhancements in generalization and reasoning capabilities.

Abstract

Medical Visual Question Answering (VQA) enhances clinical decision-making by enabling systems to interpret medical images and answer clinical queries. However, developing efficient, high-performance VQA models is challenging due to the complexity of medical imagery and diverse modalities. In this paper, we introduce a lightweight, multimodal VQA model integrating BiomedCLIP for image feature extraction and LLaMA-3 for text processing. Designed for medical VQA tasks, our model achieves state-of-the-art performance on the OmniMedVQA dataset. With approximately 8 billion parameters, it requires only two NVIDIA 40 GB A100 GPUs, demonstrating superior efficiency over larger models. Our results show 73.4% accuracy for open-end questions, surpassing existing models and validating its potential for real-world medical applications. Key contributions include a specialized multimodal VQA model, a resource-efficient architecture, and strong performance in answering open-ended clinical questions.

A Lightweight Large Vision-language Model for Multimodal Medical Images

TL;DR

This work introduces a lightweight multimodal VQA model for medical imaging that fuses BiomedCLIP image features with LLaMA-3 for text processing, targeting real-world clinical utility. Trained in two stages with LoRA fine-tuning and joint alignment, the model achieves 73.4% accuracy on OmniMedVQA while using ~8B parameters and running on 2× NVIDIA A100 GPUs, offering a favorable efficiency-performance balance. Open-ended question handling and strong cross-modal alignment enable robust clinical QA across diverse imaging modalities, though MRI remains challenging and dataset repetitiveness suggests potential overfitting. The study demonstrates that resource-efficient, open-ended medical VQA is feasible, paving the way for practical deployment and further enhancements in generalization and reasoning capabilities.

Abstract

Medical Visual Question Answering (VQA) enhances clinical decision-making by enabling systems to interpret medical images and answer clinical queries. However, developing efficient, high-performance VQA models is challenging due to the complexity of medical imagery and diverse modalities. In this paper, we introduce a lightweight, multimodal VQA model integrating BiomedCLIP for image feature extraction and LLaMA-3 for text processing. Designed for medical VQA tasks, our model achieves state-of-the-art performance on the OmniMedVQA dataset. With approximately 8 billion parameters, it requires only two NVIDIA 40 GB A100 GPUs, demonstrating superior efficiency over larger models. Our results show 73.4% accuracy for open-end questions, surpassing existing models and validating its potential for real-world medical applications. Key contributions include a specialized multimodal VQA model, a resource-efficient architecture, and strong performance in answering open-ended clinical questions.

Paper Structure

This paper contains 14 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The Architecture of LLama-CLIP model. The model takes an image (left) and an open-ended question, such as "What type of abnormality is present in this image?" The BiomedCLIP module processes the image to generate image features, while LLama encodes the question to extract text features. LLama integrates features and generates the final answer—here, identifying "interstitial lung disease" as the abnormality shown in the image.
  • Figure 2: An example of question reformulation. The left side shows the original question-and-answer format in OmniMedVQA, while the right side displays the revised format used in our experiments. The gt_answer represents the ground truth answer.
  • Figure 3: Training and test loss over epochs on OmniMedVQA.
  • Figure 4: Model outputs for three distinct medical images. Light-colored modules indicate correct answers; dark-colored ones show errors.