Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering
Yixiong Chen, Wenjie Xiao, Pedro R. A. S. Bassi, Xinze Zhou, Sezgin Er, Ibrahim Ethem Hamamci, Zongwei Zhou, Alan Yuille
TL;DR
This work introduces DeepTumorVQA, the first large-scale 3D VQA benchmark for abdominal tumor diagnosis using CT volumes, with 9,262 volumes and 395K expert-annotated question-answer pairs spanning Recognition, Measurement, Visual Reasoning, and Medical Reasoning. By evaluating four SOTA VLMs (RadFM, M3D, Merlin, CT-CHAT), the study reveals strong performance on measurement yet notable weaknesses in lesion recognition and higher-level clinical reasoning, underscoring gaps in 3D perception and domain-specific understanding. Key findings include the critical role of large-scale multimodal pretraining and full model tuning (as in RadFM), the importance of architectural choices (ViT-based 3D encoders) over single-token CNN backbones, and the effectiveness of segmentation-based preprocessing to inject anatomical priors. The benchmark illuminates concrete directions for progress in medical multimodal learning and provides open-source data, code, and evaluation tools to spur further development and safer clinical deployment.
Abstract
Vision-Language Models (VLMs) have shown promise in various 2D visual tasks, yet their readiness for 3D clinical diagnosis remains unclear due to stringent demands for recognition precision, reasoning ability, and domain knowledge. To systematically evaluate these dimensions, we present DeepTumorVQA, a diagnostic visual question answering (VQA) benchmark targeting abdominal tumors in CT scans. It comprises 9,262 CT volumes (3.7M slices) from 17 public datasets, with 395K expert-level questions spanning four categories: Recognition, Measurement, Visual Reasoning, and Medical Reasoning. DeepTumorVQA introduces unique challenges, including small tumor detection and clinical reasoning across 3D anatomy. Benchmarking four advanced VLMs (RadFM, M3D, Merlin, CT-CHAT), we find current models perform adequately on measurement tasks but struggle with lesion recognition and reasoning, and are still not meeting clinical needs. Two key insights emerge: (1) large-scale multimodal pretraining plays a crucial role in DeepTumorVQA testing performance, making RadFM stand out among all VLMs. (2) Our dataset exposes critical differences in VLM components, where proper image preprocessing and design of vision modules significantly affect 3D perception. To facilitate medical multimodal research, we have released DeepTumorVQA as a rigorous benchmark: https://github.com/Schuture/DeepTumorVQA.
