Beyond Medical Diagnostics: How Medical Multimodal Large Language Models Think in Space

Quoc-Huy Trinh; Xi Ding; Yang Liu; Zhenyue Qin; Xingjian Li; Gorkem Durak; Halil Ertugrul Aktas; Elif Keles; Ulas Bagci; Min Xu

Beyond Medical Diagnostics: How Medical Multimodal Large Language Models Think in Space

Quoc-Huy Trinh, Xi Ding, Yang Liu, Zhenyue Qin, Xingjian Li, Gorkem Durak, Halil Ertugrul Aktas, Elif Keles, Ulas Bagci, Min Xu

Abstract

Visual spatial intelligence is critical for medical image interpretation, yet remains largely unexplored in Multimodal Large Language Models (MLLMs) for 3D imaging. This gap persists due to a systemic lack of datasets featuring structured 3D spatial annotations beyond basic labels. In this study, we introduce an agentic pipeline that autonomously synthesizes spatial visual question-answering (VQA) data by orchestrating computational tools such as volume and distance calculators with multi-agent collaboration and expert radiologist validation. We present SpatialMed, the first comprehensive benchmark for evaluating 3D spatial intelligence in medical MLLMs, comprising nearly 10K question-answer pairs across multiple organs and tumor types. Our evaluations on 14 state-of-the-art MLLMs and extensive analyses reveal that current models lack robust spatial reasoning capabilities for medical imaging.

Beyond Medical Diagnostics: How Medical Multimodal Large Language Models Think in Space

Abstract

Paper Structure (14 sections, 5 equations, 5 figures, 2 tables)

This paper contains 14 sections, 5 equations, 5 figures, 2 tables.

Introduction
Related Works
Methods: SpatialMed
Data Source
Agentic Benchmark Construction Pipeline
Dataset Distribution
Experiments and Results
Experimental Setup
Evaluation on SpatialMed
Failure analysis
Conclusion
Prompt for Dataset Creation
Retrieved Augmented Generation Details
Impact Statement

Figures (5)

Figure 1: Task demonstrations in the SpatialMed, covering six spatial reasoning tasks, with corresponding 3D CT visualizations.
Figure 2: Overview of three stages from the SpatialMed dataset pipeline.(1) Question--Answer Pair Generation, where agents produce clinically grounded QA pairs using medical knowledge and spatial analysis tools; (2) Data Quality Validation, in which multiple specialist agents verify medical correctness and the necessity of visual--spatial evidence; and (3) Radiologists validation, where multiple radiologists review participating in review and validate the quality of the dataset.
Figure 3: Benchmark Statistics.Left: Distribution of annotated anatomical regions in the MCA task. Right: Dataset distribution across the volume task, where the y-axis is shown on a log$_2$ scale.
Figure 4: Fine-grained performance analysis across anatomical structures, tumor types, and volume scales. (a) Per-organ accuracy across models in the MCA task. (b) Tumor-wise accuracy across selected models. (c) Performance stratified by anatomical volume buckets using Mean Relative Accuracy.
Figure 5: Failure and faithfulness analysis of MLLM spatial reasoning. (a) Human-annotated taxonomy of reasoning errors, where numeric and relational errors dominate. (b) Faithfulness matrix categorizing predictions into faithful reasoning, decision errors, lucky guesses, and hallucination.

Beyond Medical Diagnostics: How Medical Multimodal Large Language Models Think in Space

Abstract

Beyond Medical Diagnostics: How Medical Multimodal Large Language Models Think in Space

Authors

Abstract

Table of Contents

Figures (5)