Table of Contents
Fetching ...

MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning

Suhao Yu, Haojin Wang, Juncheng Wu, Cihang Xie, Yuyin Zhou

TL;DR

MedFrameQA addresses the lack of clinically grounded multi-image reasoning in medical VQA by introducing a benchmark where each question requires integration across 2–5 frames from medical videos. A scalable pipeline collects YouTube medical videos, extracts and aligns frames with refined captions, merges frames into coherent clips, and uses GPT-4o to generate cross-frame VQA items with ground-truth rationales. Ten state-of-the-art MLLMs, including reasoning-enabled and non-reasoning models, are evaluated and reveal that most accuracies stay below 50%, with reasoning models offering gains but still struggling due to information neglect and error propagation across frames. The dataset and methodology provide a valuable resource for advancing clinically grounded, multi-image diagnostic AI and setting baselines for cross-frame medical reasoning research.

Abstract

Existing medical VQA benchmarks mostly focus on single-image analysis, yet clinicians almost always compare a series of images before reaching a diagnosis. To better approximate this workflow, we introduce MedFrameQA -- the first benchmark that explicitly evaluates multi-image reasoning in medical VQA. To build MedFrameQA both at scale and in high-quality, we develop 1) an automated pipeline that extracts temporally coherent frames from medical videos and constructs VQA items whose content evolves logically across images, and 2) a multiple-stage filtering strategy, including model-based and manual review, to preserve data clarity, difficulty, and medical relevance. The resulting dataset comprises 2,851 VQA pairs (gathered from 9,237 high-quality frames in 3,420 videos), covering nine human body systems and 43 organs; every question is accompanied by two to five images. We comprehensively benchmark ten advanced Multimodal LLMs -- both proprietary and open source, with and without explicit reasoning modules -- on MedFrameQA. The evaluation challengingly reveals that all models perform poorly, with most accuracies below 50%, and accuracy fluctuates as the number of images per question increases. Error analysis further shows that models frequently ignore salient findings, mis-aggregate evidence across images, and propagate early mistakes through their reasoning chains; results also vary substantially across body systems, organs, and modalities. We hope this work can catalyze research on clinically grounded, multi-image reasoning and accelerate progress toward more capable diagnostic AI systems.

MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning

TL;DR

MedFrameQA addresses the lack of clinically grounded multi-image reasoning in medical VQA by introducing a benchmark where each question requires integration across 2–5 frames from medical videos. A scalable pipeline collects YouTube medical videos, extracts and aligns frames with refined captions, merges frames into coherent clips, and uses GPT-4o to generate cross-frame VQA items with ground-truth rationales. Ten state-of-the-art MLLMs, including reasoning-enabled and non-reasoning models, are evaluated and reveal that most accuracies stay below 50%, with reasoning models offering gains but still struggling due to information neglect and error propagation across frames. The dataset and methodology provide a valuable resource for advancing clinically grounded, multi-image diagnostic AI and setting baselines for cross-frame medical reasoning research.

Abstract

Existing medical VQA benchmarks mostly focus on single-image analysis, yet clinicians almost always compare a series of images before reaching a diagnosis. To better approximate this workflow, we introduce MedFrameQA -- the first benchmark that explicitly evaluates multi-image reasoning in medical VQA. To build MedFrameQA both at scale and in high-quality, we develop 1) an automated pipeline that extracts temporally coherent frames from medical videos and constructs VQA items whose content evolves logically across images, and 2) a multiple-stage filtering strategy, including model-based and manual review, to preserve data clarity, difficulty, and medical relevance. The resulting dataset comprises 2,851 VQA pairs (gathered from 9,237 high-quality frames in 3,420 videos), covering nine human body systems and 43 organs; every question is accompanied by two to five images. We comprehensively benchmark ten advanced Multimodal LLMs -- both proprietary and open source, with and without explicit reasoning modules -- on MedFrameQA. The evaluation challengingly reveals that all models perform poorly, with most accuracies below 50%, and accuracy fluctuates as the number of images per question increases. Error analysis further shows that models frequently ignore salient findings, mis-aggregate evidence across images, and propagate early mistakes through their reasoning chains; results also vary substantially across body systems, organs, and modalities. We hope this work can catalyze research on clinically grounded, multi-image reasoning and accelerate progress toward more capable diagnostic AI systems.

Paper Structure

This paper contains 34 sections, 4 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Comparison of medical VQA benchmarks.MedFrameQA introduces multi-image, clinically grounded questions that require comprehensive reasoning across all images. Unlike prior benchmarks such as SLAKE DBLP:conf/isbi/LiuZXMYW21 and MedXpertQA DBLP:journals/corr/abs-2501-18362, it emphasizes diagnostic complexity, expert-level knowledge, and explicit reasoning chains.
  • Figure 2: Our data generation pipeline. (a) Medical Video Collection: Collecting 3,420 medical videos via clinical search queries (\ref{['video_collection']}). (b) Frame-Caption Pairing: Extracting keyframes and aligning with transcribed captions. (\ref{['pair_process']}) (c) Multi-Frame Merging: Merging clinically related frame-caption pairs into multi-frame clips. (\ref{['frame_merge']})(d) Question-Answer Generation: Generating multi-image VQA from the multi-frame clips. (\ref{['QA_generation']})
  • Figure 3: Failure case study of o1 on MedFrameQA. Negligence of important information across multiple frames. In this case, o1 overlooked critical features in the second and third frames, which ultimately led to the selection of an incorrect answer.
  • Figure 4: Failure case study of o1 on MedFrameQA. A mistake originating from a single image can result in significant errors in subsequent reasoning. In this case, o1 made a directional error when interpreting the first frame, which propagated through its reasoning process and ultimately led to an incorrect answer.
  • Figure 5: Data distribution of MedFrameQA. In \ref{['fig:data_distribution']}(a), we show the distribution across body systems; (b) presents the distribution across organs; (c) shows the distribution across imaging modalities; (d) provides a word cloud of keywords in MedFrameQA; and (e) reports the distribution of frame counts per question.
  • ...and 10 more figures