Table of Contents
Fetching ...

GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis

Bo Liu, Ke Zou, Liming Zhan, Zexin Lu, Xiaoyu Dong, Yidi Chen, Chengqiang Xie, Jiannong Cao, Xiao-Ming Wu, Huazhu Fu

TL;DR

GEMeX addresses critical gaps in medical VQA by delivering a large-scale chest X-ray benchmark that jointly provides robust textual and visual explanations. The authors implement a two-stage pipeline—re-grounding radiology reports to 30 precise anatomical regions and generating groundable VQA via GPT-4o—to create 151,025 images and 1,605,575 questions, the largest chest X-ray VQA resource to date. They evaluate 12 LVLMs and show that fine-tuning on GEMeX yields substantial gains in both answer accuracy and visual grounding, while underscoring the remaining challenges in truly multimodal medical reasoning. The work demonstrates the value of multimodal explanations for patient and clinician understanding and offers a strong benchmark and baseline for future medical LVLM development, with broad implications for explainable AI in healthcare.

Abstract

Medical Visual Question Answering (Med-VQA) combines computer vision and natural language processing to automatically answer clinical inquiries about medical images. However, current Med-VQA datasets exhibit two significant limitations: (1) they often lack visual and textual explanations for answers, hindering comprehension for patients and junior doctors; (2) they typically offer a narrow range of question formats, inadequately reflecting the diverse requirements in practical scenarios. These limitations pose significant challenges to the development of a reliable and user-friendly Med-VQA system. To address these challenges, we introduce a large-scale, Groundable, and Explainable Medical VQA benchmark for chest X-ray diagnosis (GEMeX), featuring several innovative components: (1) a multi-modal explainability mechanism that offers detailed visual and textual explanations for each question-answer pair, thereby enhancing answer comprehensibility; (2) four question types, open-ended, closed-ended, single-choice, and multiple-choice, to better reflect practical needs. With 151,025 images and 1,605,575 questions, GEMeX is the currently largest chest X-ray VQA dataset. Evaluation of 12 representative large vision language models (LVLMs) on GEMeX reveals suboptimal performance, underscoring the dataset's complexity. Meanwhile, we propose a strong model by fine-tuning an existing LVLM on the GEMeX training set. The substantial performance improvement showcases the dataset's effectiveness. The benchmark is available at https://www.med-vqa.com/GEMeX.

GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis

TL;DR

GEMeX addresses critical gaps in medical VQA by delivering a large-scale chest X-ray benchmark that jointly provides robust textual and visual explanations. The authors implement a two-stage pipeline—re-grounding radiology reports to 30 precise anatomical regions and generating groundable VQA via GPT-4o—to create 151,025 images and 1,605,575 questions, the largest chest X-ray VQA resource to date. They evaluate 12 LVLMs and show that fine-tuning on GEMeX yields substantial gains in both answer accuracy and visual grounding, while underscoring the remaining challenges in truly multimodal medical reasoning. The work demonstrates the value of multimodal explanations for patient and clinician understanding and offers a strong benchmark and baseline for future medical LVLM development, with broad implications for explainable AI in healthcare.

Abstract

Medical Visual Question Answering (Med-VQA) combines computer vision and natural language processing to automatically answer clinical inquiries about medical images. However, current Med-VQA datasets exhibit two significant limitations: (1) they often lack visual and textual explanations for answers, hindering comprehension for patients and junior doctors; (2) they typically offer a narrow range of question formats, inadequately reflecting the diverse requirements in practical scenarios. These limitations pose significant challenges to the development of a reliable and user-friendly Med-VQA system. To address these challenges, we introduce a large-scale, Groundable, and Explainable Medical VQA benchmark for chest X-ray diagnosis (GEMeX), featuring several innovative components: (1) a multi-modal explainability mechanism that offers detailed visual and textual explanations for each question-answer pair, thereby enhancing answer comprehensibility; (2) four question types, open-ended, closed-ended, single-choice, and multiple-choice, to better reflect practical needs. With 151,025 images and 1,605,575 questions, GEMeX is the currently largest chest X-ray VQA dataset. Evaluation of 12 representative large vision language models (LVLMs) on GEMeX reveals suboptimal performance, underscoring the dataset's complexity. Meanwhile, we propose a strong model by fine-tuning an existing LVLM on the GEMeX training set. The substantial performance improvement showcases the dataset's effectiveness. The benchmark is available at https://www.med-vqa.com/GEMeX.

Paper Structure

This paper contains 30 sections, 7 figures, 18 tables.

Figures (7)

  • Figure 1: Our GEMeX stands out from existing medical VQA datasets by providing diverse question types and comprehensive multimodal explanations: textual reasoning and visual grounding.
  • Figure 2: Illustration of the proposed pipeline for constructing our GEMeX, with two main stages. In Stage I (left), medical LLM performs re-grounding on the original reports based on the pathological regions and clinical guidance specified by the radiologists, generating more precise sentence-region correspondence. In Stage II (right), the well-crafted prompt enables GPT-4o to generate a high-quality, large-scale Med-VQA dataset with both textual and visual explanations, leveraging the re-grounded reports from Stage I.
  • Figure 3: The distribution of normality and abnormality contained in images from the test set of our GEMeX.
  • Figure 4: The distribution of question content in our GEMeX.
  • Figure 5: Distribution of anatomical regions corresponding to each sentence after transformation from the Chest ImaGenome dataset.
  • ...and 2 more figures