GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis
Bo Liu, Ke Zou, Liming Zhan, Zexin Lu, Xiaoyu Dong, Yidi Chen, Chengqiang Xie, Jiannong Cao, Xiao-Ming Wu, Huazhu Fu
TL;DR
GEMeX addresses critical gaps in medical VQA by delivering a large-scale chest X-ray benchmark that jointly provides robust textual and visual explanations. The authors implement a two-stage pipeline—re-grounding radiology reports to 30 precise anatomical regions and generating groundable VQA via GPT-4o—to create 151,025 images and 1,605,575 questions, the largest chest X-ray VQA resource to date. They evaluate 12 LVLMs and show that fine-tuning on GEMeX yields substantial gains in both answer accuracy and visual grounding, while underscoring the remaining challenges in truly multimodal medical reasoning. The work demonstrates the value of multimodal explanations for patient and clinician understanding and offers a strong benchmark and baseline for future medical LVLM development, with broad implications for explainable AI in healthcare.
Abstract
Medical Visual Question Answering (Med-VQA) combines computer vision and natural language processing to automatically answer clinical inquiries about medical images. However, current Med-VQA datasets exhibit two significant limitations: (1) they often lack visual and textual explanations for answers, hindering comprehension for patients and junior doctors; (2) they typically offer a narrow range of question formats, inadequately reflecting the diverse requirements in practical scenarios. These limitations pose significant challenges to the development of a reliable and user-friendly Med-VQA system. To address these challenges, we introduce a large-scale, Groundable, and Explainable Medical VQA benchmark for chest X-ray diagnosis (GEMeX), featuring several innovative components: (1) a multi-modal explainability mechanism that offers detailed visual and textual explanations for each question-answer pair, thereby enhancing answer comprehensibility; (2) four question types, open-ended, closed-ended, single-choice, and multiple-choice, to better reflect practical needs. With 151,025 images and 1,605,575 questions, GEMeX is the currently largest chest X-ray VQA dataset. Evaluation of 12 representative large vision language models (LVLMs) on GEMeX reveals suboptimal performance, underscoring the dataset's complexity. Meanwhile, we propose a strong model by fine-tuning an existing LVLM on the GEMeX training set. The substantial performance improvement showcases the dataset's effectiveness. The benchmark is available at https://www.med-vqa.com/GEMeX.
