Table of Contents
Fetching ...

GEMeX-RMCoT: An Enhanced Med-VQA Dataset for Region-Aware Multimodal Chain-of-Thought Reasoning

Bo Liu, Xiangyu Zhao, Along He, Yidi Chen, Huazhu Fu, Xiao-Ming Wu

TL;DR

This work targets interpretability in Medical Visual Question Answering by introducing GEMeX-RMCoT, a region-grounded multimodal chain-of-thought dataset that ties reasoning steps to anatomical regions. It combines supervised fine-tuning with a novel verifiable reinforcement-learning reward to align intermediate thinking with both visual grounding and final answers, achieving competitive results using only 1/8 of the full data. The approach yields more trustworthy and transparent Med-VQA outputs, with improved robustness to input variations and clearer evidence grounding. Overall, GEMeX-RMCoT advances explainable AI in medical imaging by enabling step-by-step, region-informed reasoning with verifiable outcomes that clinicians can inspect and trust.

Abstract

Medical visual question answering aims to support clinical decision-making by enabling models to answer natural language questions based on medical images. While recent advances in multi-modal learning have significantly improved performance, current methods still suffer from limited answer reliability and poor interpretability, impairing the ability of clinicians and patients to understand and trust model outputs. To address these limitations, this work first proposes a Region-Aware Multimodal Chain-of-Thought (RMCoT) dataset, in which the process of producing an answer is preceded by a sequence of intermediate reasoning steps that explicitly ground relevant visual regions of the medical image, thereby providing fine-grained explainability. Furthermore, we introduce a novel verifiable reward mechanism for reinforcement learning to guide post-training, improving the alignment between the model's reasoning process and its final answer. Remarkably, our method achieves comparable performance using only one-eighth of the training data, demonstrating the efficiency and effectiveness of the proposal. The dataset is available at https://www.med-vqa.com/GEMeX/.

GEMeX-RMCoT: An Enhanced Med-VQA Dataset for Region-Aware Multimodal Chain-of-Thought Reasoning

TL;DR

This work targets interpretability in Medical Visual Question Answering by introducing GEMeX-RMCoT, a region-grounded multimodal chain-of-thought dataset that ties reasoning steps to anatomical regions. It combines supervised fine-tuning with a novel verifiable reinforcement-learning reward to align intermediate thinking with both visual grounding and final answers, achieving competitive results using only 1/8 of the full data. The approach yields more trustworthy and transparent Med-VQA outputs, with improved robustness to input variations and clearer evidence grounding. Overall, GEMeX-RMCoT advances explainable AI in medical imaging by enabling step-by-step, region-informed reasoning with verifiable outcomes that clinicians can inspect and trust.

Abstract

Medical visual question answering aims to support clinical decision-making by enabling models to answer natural language questions based on medical images. While recent advances in multi-modal learning have significantly improved performance, current methods still suffer from limited answer reliability and poor interpretability, impairing the ability of clinicians and patients to understand and trust model outputs. To address these limitations, this work first proposes a Region-Aware Multimodal Chain-of-Thought (RMCoT) dataset, in which the process of producing an answer is preceded by a sequence of intermediate reasoning steps that explicitly ground relevant visual regions of the medical image, thereby providing fine-grained explainability. Furthermore, we introduce a novel verifiable reward mechanism for reinforcement learning to guide post-training, improving the alignment between the model's reasoning process and its final answer. Remarkably, our method achieves comparable performance using only one-eighth of the training data, demonstrating the efficiency and effectiveness of the proposal. The dataset is available at https://www.med-vqa.com/GEMeX/.

Paper Structure

This paper contains 22 sections, 2 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: (Top) Traditional answer with explainability provided by GEMeX: separate textual and visual prompts; ($\star$Bottom) Our multimodal, more detailed thinking process for answer generation, which explicitly grounds evidence in specific regions (shown in the colored texts) of the medical image, i.e., anatomical areas that support diagnosis, thereby enhancing the understanding of questions and answers.
  • Figure 2: Illustration of the proposed uncertainty-driven multi-agent framework for generating the RMCoT dataset. The left (orange) section depicts the data generation process, while the right (blue) section shows the uncertainty-driven quality assurance pipeline, i.e., the "not sure" option prompts a deeper reflection and more deliberate decision-making.
  • Figure 3: An example from the SFT and RFT stages: after SFT, the model learns to think carefully and incorporates grounding to specific pathological regions for visual evidence. During the RFT stage, since both the answer and the grounded regions are correct, the two reward scores are each 1.0.
  • Figure 4: One challenging example from GEMeX answered by models trained with and without RMCoT. (✓) or (✘) in outputs highlight correct or incorrect reasons or answers. The colored words indicate the thinking with visual grounding process.