Table of Contents
Fetching ...

MiniGPT-Med: Large Language Model as a General Interface for Radiology Diagnosis

Asma Alkhaldi, Raneem Alnajim, Layan Alabdullatef, Rawan Alyahya, Jun Chen, Deyao Zhu, Ahmed Alsinan, Mohamed Elhoseiny

TL;DR

This work introduces MiniGPT-Med, a unified radiology vision-language system derived from a large language model to perform medical report generation, disease grounding, and medical visual question answering across X-ray, CT, and MRI data. It employs a frozen EVA vision backbone, a linear projection, and a LLaMA-2 chat LM, guided by task-specific prompts to enable diverse radiology tasks. The approach achieves state-of-the-art performance in medical report generation (e.g., surpassing CheXagent by $21.6$ in BERTsim and $5.2$ in CheXbert-Sim) and competitive results in grounding and VQA, with radiologist evaluation indicating $76\%$ high-quality reports. This work demonstrates MiniGPT-Med as a general radiology interface with potential to improve diagnostic efficiency and lays groundwork for broader clinical validation.

Abstract

Recent advancements in artificial intelligence (AI) have precipitated significant breakthroughs in healthcare, particularly in refining diagnostic procedures. However, previous studies have often been constrained to limited functionalities. This study introduces MiniGPT-Med, a vision-language model derived from large-scale language models and tailored for medical applications. MiniGPT-Med demonstrates remarkable versatility across various imaging modalities, including X-rays, CT scans, and MRIs, enhancing its utility. The model is capable of performing tasks such as medical report generation, visual question answering (VQA), and disease identification within medical imagery. Its integrated processing of both image and textual clinical data markedly improves diagnostic accuracy. Our empirical assessments confirm MiniGPT-Med's superior performance in disease grounding, medical report generation, and VQA benchmarks, representing a significant step towards reducing the gap in assisting radiology practice. Furthermore, it achieves state-of-the-art performance on medical report generation, higher than the previous best model by 19\% accuracy. MiniGPT-Med promises to become a general interface for radiology diagnoses, enhancing diagnostic efficiency across a wide range of medical imaging applications.

MiniGPT-Med: Large Language Model as a General Interface for Radiology Diagnosis

TL;DR

This work introduces MiniGPT-Med, a unified radiology vision-language system derived from a large language model to perform medical report generation, disease grounding, and medical visual question answering across X-ray, CT, and MRI data. It employs a frozen EVA vision backbone, a linear projection, and a LLaMA-2 chat LM, guided by task-specific prompts to enable diverse radiology tasks. The approach achieves state-of-the-art performance in medical report generation (e.g., surpassing CheXagent by in BERTsim and in CheXbert-Sim) and competitive results in grounding and VQA, with radiologist evaluation indicating high-quality reports. This work demonstrates MiniGPT-Med as a general radiology interface with potential to improve diagnostic efficiency and lays groundwork for broader clinical validation.

Abstract

Recent advancements in artificial intelligence (AI) have precipitated significant breakthroughs in healthcare, particularly in refining diagnostic procedures. However, previous studies have often been constrained to limited functionalities. This study introduces MiniGPT-Med, a vision-language model derived from large-scale language models and tailored for medical applications. MiniGPT-Med demonstrates remarkable versatility across various imaging modalities, including X-rays, CT scans, and MRIs, enhancing its utility. The model is capable of performing tasks such as medical report generation, visual question answering (VQA), and disease identification within medical imagery. Its integrated processing of both image and textual clinical data markedly improves diagnostic accuracy. Our empirical assessments confirm MiniGPT-Med's superior performance in disease grounding, medical report generation, and VQA benchmarks, representing a significant step towards reducing the gap in assisting radiology practice. Furthermore, it achieves state-of-the-art performance on medical report generation, higher than the previous best model by 19\% accuracy. MiniGPT-Med promises to become a general interface for radiology diagnoses, enhancing diagnostic efficiency across a wide range of medical imaging applications.
Paper Structure (20 sections, 1 equation, 5 figures, 5 tables)

This paper contains 20 sections, 1 equation, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The diverse capabilities by MiniGPT-Med. It can perform disease detection, medical visual question answering, and medical report generation. MiniGPT-Med effectively works with a wide range of radiological data (X-rays, CT scans, and MRIs) and is adept at diagnosing many diseases.
  • Figure 2: MiniGPT-Med Architecture Overview: The architecture comprises a vision encoder, a linear projection layer, and a large language model. It processes a single medical image, transforming it into visual semantic features via a pre-trained vision encoder. These features are concatenated into a single visual token. A linear projection layer then maps these visual tokens into the large language model's space. Throughout the training process, we maintain the vision encoder's parameters constant while fine-tuning the large language model and linear projection layer.
  • Figure 3: Examples of MiniGPT-Med multi-task abilities include: (a) medical report generation and (b) disease detection
  • Figure 4: (Continued) Examples of MiniGPT-Med multi-task abilities include: (c) grounded medical image description, (d) referring disease grounding,(e) identifying diseases, and (f) visual question answering (VQA).
  • Figure 5: False positive example. The object under the red bounding box denotes the falsely detected disease and the bounding box under the green color represents the ground truth.