GEMeX-RMCoT: An Enhanced Med-VQA Dataset for Region-Aware Multimodal Chain-of-Thought Reasoning

Bo Liu; Xiangyu Zhao; Along He; Yidi Chen; Huazhu Fu; Xiao-Ming Wu

GEMeX-RMCoT: An Enhanced Med-VQA Dataset for Region-Aware Multimodal Chain-of-Thought Reasoning

Bo Liu, Xiangyu Zhao, Along He, Yidi Chen, Huazhu Fu, Xiao-Ming Wu

TL;DR

This work targets interpretability in Medical Visual Question Answering by introducing GEMeX-RMCoT, a region-grounded multimodal chain-of-thought dataset that ties reasoning steps to anatomical regions. It combines supervised fine-tuning with a novel verifiable reinforcement-learning reward to align intermediate thinking with both visual grounding and final answers, achieving competitive results using only 1/8 of the full data. The approach yields more trustworthy and transparent Med-VQA outputs, with improved robustness to input variations and clearer evidence grounding. Overall, GEMeX-RMCoT advances explainable AI in medical imaging by enabling step-by-step, region-informed reasoning with verifiable outcomes that clinicians can inspect and trust.

Abstract

Medical visual question answering aims to support clinical decision-making by enabling models to answer natural language questions based on medical images. While recent advances in multi-modal learning have significantly improved performance, current methods still suffer from limited answer reliability and poor interpretability, impairing the ability of clinicians and patients to understand and trust model outputs. To address these limitations, this work first proposes a Region-Aware Multimodal Chain-of-Thought (RMCoT) dataset, in which the process of producing an answer is preceded by a sequence of intermediate reasoning steps that explicitly ground relevant visual regions of the medical image, thereby providing fine-grained explainability. Furthermore, we introduce a novel verifiable reward mechanism for reinforcement learning to guide post-training, improving the alignment between the model's reasoning process and its final answer. Remarkably, our method achieves comparable performance using only one-eighth of the training data, demonstrating the efficiency and effectiveness of the proposal. The dataset is available at https://www.med-vqa.com/GEMeX/.

GEMeX-RMCoT: An Enhanced Med-VQA Dataset for Region-Aware Multimodal Chain-of-Thought Reasoning

TL;DR

Abstract

GEMeX-RMCoT: An Enhanced Med-VQA Dataset for Region-Aware Multimodal Chain-of-Thought Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)