Table of Contents
Fetching ...

Med-CMR: A Fine-Grained Benchmark Integrating Visual Evidence and Clinical Logic for Medical Complex Multimodal Reasoning

Haozhen Gong, Xiaozhong Ji, Yuansen Liu, Wenbin Wu, Xiaoxiao Yan, Jingjing Liu, Kai Wu, Jiazhen Pan, Bailiang Jian, Jiangning Zhang, Xiaobin Hu, Hongwei Bran Li

TL;DR

Med-CMR introduces a fine-grained benchmark for medical multimodal reasoning by decoupling visual perception from higher-order inference across seven task dimensions and 11 body systems over 12 imaging modalities. It combines a robust data-curation pipeline with MCQ and open-ended evaluation, using an external LLM-based scorer and human alignment checks to ensure clinical validity and reliability. Across 18 state-of-the-art MLLMs, GPT-5 achieves top performance, but results reveal persistent gaps in long-tail generalization and in integrating subtle visual cues with complex reasoning, even when models are medically fine-tuned. The work provides a rigorous stress test and a practical yardstick for future clinical AI systems and highlights where future methodological advances are most needed to achieve expert-level medical reasoning.

Abstract

MLLMs MLLMs are beginning to appear in clinical workflows, but their ability to perform complex medical reasoning remains unclear. We present Med-CMR, a fine-grained Medical Complex Multimodal Reasoning benchmark. Med-CMR distinguishes from existing counterparts by three core features: 1) Systematic capability decomposition, splitting medical multimodal reasoning into fine-grained visual understanding and multi-step reasoning to enable targeted evaluation; 2) Challenging task design, with visual understanding across three key dimensions (small-object detection, fine-detail discrimination, spatial understanding) and reasoning covering four clinically relevant scenarios (temporal prediction, causal reasoning, long-tail generalization, multi-source integration); 3) Broad, high-quality data coverage, comprising 20,653 Visual Question Answering (VQA) pairs spanning 11 organ systems and 12 imaging modalities, validated via a rigorous two-stage (human expert + model-assisted) review to ensure clinical authenticity. We evaluate 18 state-of-the-art MLLMs with Med-CMR, revealing GPT-5 as the top-performing commercial model: 57.81 accuracy on multiple-choice questions (MCQs) and a 48.70 open-ended score, outperforming Gemini 2.5 Pro (49.87 MCQ accuracy, 45.98 open-ended score) and leading open-source model Qwen3-VL-235B-A22B (49.34 MCQ accuracy, 42.62 open-ended score). However, specialized medical MLLMs do not reliably outperform strong general models, and long-tail generalization emerges as the dominant failure mode. Med-CMR thus provides a stress test for visual-reasoning integration and rare-case robustness in medical MLLMs, and a rigorous yardstick for future clinical systems.

Med-CMR: A Fine-Grained Benchmark Integrating Visual Evidence and Clinical Logic for Medical Complex Multimodal Reasoning

TL;DR

Med-CMR introduces a fine-grained benchmark for medical multimodal reasoning by decoupling visual perception from higher-order inference across seven task dimensions and 11 body systems over 12 imaging modalities. It combines a robust data-curation pipeline with MCQ and open-ended evaluation, using an external LLM-based scorer and human alignment checks to ensure clinical validity and reliability. Across 18 state-of-the-art MLLMs, GPT-5 achieves top performance, but results reveal persistent gaps in long-tail generalization and in integrating subtle visual cues with complex reasoning, even when models are medically fine-tuned. The work provides a rigorous stress test and a practical yardstick for future clinical AI systems and highlights where future methodological advances are most needed to achieve expert-level medical reasoning.

Abstract

MLLMs MLLMs are beginning to appear in clinical workflows, but their ability to perform complex medical reasoning remains unclear. We present Med-CMR, a fine-grained Medical Complex Multimodal Reasoning benchmark. Med-CMR distinguishes from existing counterparts by three core features: 1) Systematic capability decomposition, splitting medical multimodal reasoning into fine-grained visual understanding and multi-step reasoning to enable targeted evaluation; 2) Challenging task design, with visual understanding across three key dimensions (small-object detection, fine-detail discrimination, spatial understanding) and reasoning covering four clinically relevant scenarios (temporal prediction, causal reasoning, long-tail generalization, multi-source integration); 3) Broad, high-quality data coverage, comprising 20,653 Visual Question Answering (VQA) pairs spanning 11 organ systems and 12 imaging modalities, validated via a rigorous two-stage (human expert + model-assisted) review to ensure clinical authenticity. We evaluate 18 state-of-the-art MLLMs with Med-CMR, revealing GPT-5 as the top-performing commercial model: 57.81 accuracy on multiple-choice questions (MCQs) and a 48.70 open-ended score, outperforming Gemini 2.5 Pro (49.87 MCQ accuracy, 45.98 open-ended score) and leading open-source model Qwen3-VL-235B-A22B (49.34 MCQ accuracy, 42.62 open-ended score). However, specialized medical MLLMs do not reliably outperform strong general models, and long-tail generalization emerges as the dominant failure mode. Med-CMR thus provides a stress test for visual-reasoning integration and rare-case robustness in medical MLLMs, and a rigorous yardstick for future clinical systems.

Paper Structure

This paper contains 29 sections, 1 equation, 16 figures, 11 tables.

Figures (16)

  • Figure 1: Overview of Med-CMR. Med-CMR decomposes medical multimodal reasoning complexity into visual complexity (i.e., small-object detection, fine-detail discrimination, and spatial understanding) and reasoning complexity (i.e., temporal prediction, causal reasoning, long-tail generalization, and multi-source integration). Each dimension corresponds to a specific task designed to evaluate the model’s capability in that dimension.
  • Figure 2: Benchmark statistics. The left panel displays the inference types across seven questions in the benchmark and their corresponding quantitative relationships with medical ability. The right side shows the modalities of the benchmark images and the body systems involved.
  • Figure 3: Correlation between model size and performance in different metrics.
  • Figure 4: (a) Human-labeled GPT-5 error distribution across question dimensions. (b) Comparison of base models and corresponding medical models on the Med-CMR metrics. (c) MCQ and open-ended results from 500 reformulated MCQs for comparing base models and corresponding medical models. (d) Comparison of win ratios under human and LLM (DeepSeek-V3.2-Exp) evaluation across four dimensions.
  • Figure :
  • ...and 11 more figures