Table of Contents
Fetching ...

Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering

Yixiong Chen, Wenjie Xiao, Pedro R. A. S. Bassi, Xinze Zhou, Sezgin Er, Ibrahim Ethem Hamamci, Zongwei Zhou, Alan Yuille

TL;DR

This work introduces DeepTumorVQA, the first large-scale 3D VQA benchmark for abdominal tumor diagnosis using CT volumes, with 9,262 volumes and 395K expert-annotated question-answer pairs spanning Recognition, Measurement, Visual Reasoning, and Medical Reasoning. By evaluating four SOTA VLMs (RadFM, M3D, Merlin, CT-CHAT), the study reveals strong performance on measurement yet notable weaknesses in lesion recognition and higher-level clinical reasoning, underscoring gaps in 3D perception and domain-specific understanding. Key findings include the critical role of large-scale multimodal pretraining and full model tuning (as in RadFM), the importance of architectural choices (ViT-based 3D encoders) over single-token CNN backbones, and the effectiveness of segmentation-based preprocessing to inject anatomical priors. The benchmark illuminates concrete directions for progress in medical multimodal learning and provides open-source data, code, and evaluation tools to spur further development and safer clinical deployment.

Abstract

Vision-Language Models (VLMs) have shown promise in various 2D visual tasks, yet their readiness for 3D clinical diagnosis remains unclear due to stringent demands for recognition precision, reasoning ability, and domain knowledge. To systematically evaluate these dimensions, we present DeepTumorVQA, a diagnostic visual question answering (VQA) benchmark targeting abdominal tumors in CT scans. It comprises 9,262 CT volumes (3.7M slices) from 17 public datasets, with 395K expert-level questions spanning four categories: Recognition, Measurement, Visual Reasoning, and Medical Reasoning. DeepTumorVQA introduces unique challenges, including small tumor detection and clinical reasoning across 3D anatomy. Benchmarking four advanced VLMs (RadFM, M3D, Merlin, CT-CHAT), we find current models perform adequately on measurement tasks but struggle with lesion recognition and reasoning, and are still not meeting clinical needs. Two key insights emerge: (1) large-scale multimodal pretraining plays a crucial role in DeepTumorVQA testing performance, making RadFM stand out among all VLMs. (2) Our dataset exposes critical differences in VLM components, where proper image preprocessing and design of vision modules significantly affect 3D perception. To facilitate medical multimodal research, we have released DeepTumorVQA as a rigorous benchmark: https://github.com/Schuture/DeepTumorVQA.

Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering

TL;DR

This work introduces DeepTumorVQA, the first large-scale 3D VQA benchmark for abdominal tumor diagnosis using CT volumes, with 9,262 volumes and 395K expert-annotated question-answer pairs spanning Recognition, Measurement, Visual Reasoning, and Medical Reasoning. By evaluating four SOTA VLMs (RadFM, M3D, Merlin, CT-CHAT), the study reveals strong performance on measurement yet notable weaknesses in lesion recognition and higher-level clinical reasoning, underscoring gaps in 3D perception and domain-specific understanding. Key findings include the critical role of large-scale multimodal pretraining and full model tuning (as in RadFM), the importance of architectural choices (ViT-based 3D encoders) over single-token CNN backbones, and the effectiveness of segmentation-based preprocessing to inject anatomical priors. The benchmark illuminates concrete directions for progress in medical multimodal learning and provides open-source data, code, and evaluation tools to spur further development and safer clinical deployment.

Abstract

Vision-Language Models (VLMs) have shown promise in various 2D visual tasks, yet their readiness for 3D clinical diagnosis remains unclear due to stringent demands for recognition precision, reasoning ability, and domain knowledge. To systematically evaluate these dimensions, we present DeepTumorVQA, a diagnostic visual question answering (VQA) benchmark targeting abdominal tumors in CT scans. It comprises 9,262 CT volumes (3.7M slices) from 17 public datasets, with 395K expert-level questions spanning four categories: Recognition, Measurement, Visual Reasoning, and Medical Reasoning. DeepTumorVQA introduces unique challenges, including small tumor detection and clinical reasoning across 3D anatomy. Benchmarking four advanced VLMs (RadFM, M3D, Merlin, CT-CHAT), we find current models perform adequately on measurement tasks but struggle with lesion recognition and reasoning, and are still not meeting clinical needs. Two key insights emerge: (1) large-scale multimodal pretraining plays a crucial role in DeepTumorVQA testing performance, making RadFM stand out among all VLMs. (2) Our dataset exposes critical differences in VLM components, where proper image preprocessing and design of vision modules significantly affect 3D perception. To facilitate medical multimodal research, we have released DeepTumorVQA as a rigorous benchmark: https://github.com/Schuture/DeepTumorVQA.

Paper Structure

This paper contains 28 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overview of tasks in the DeepTumorVQA benchmark. The dataset covers four core clinical question types, totaling 29 subtypes. Tasks include numerical quantification (e.g., organ volume, Hounsfield Unit (HU) value), lesion recognition, spatial reasoning (e.g., comparisons, adjacency), and high-level clinical diagnosis (e.g., tumor staging, resectability). Each question is paired with image evidence and formatted for either multiple-choice or free-text answer prediction, enabling evaluation of both perceptual and diagnostic reasoning in VLMs.
  • Figure 2: Statistics of DeepTumorVQA. Left: the distribution of QA pairs for tasks across four main types. Right: distribution of CT volumes w.r.t. CT physical depth (z-axis) and patient types.
  • Figure 3: Overview of question construction in the DeepTumorVQA dataset. (A) Structured metadata is extracted from organ and lesion segmentation masks (e.g., location, volume, HU value, enlargement) and parsed radiology reports (e.g., lesion type, adjacent organs, vascular involvement). (B) These metadata are used to define modular logic programs for different diagnostic question types. For example, liver segment-level lesion counts are used to construct distribution-based visual reasoning questions. Each program maps to one of four task types and 29 subtypes, and is rendered into natural language using predefined question templates.
  • Figure 4: The RadFM accuracy of reasoning tasks with or without measurement/recognition tasks.
  • Figure 5: Lesion recognition sensitivity of RadFM under different lesion sizes (left) and HU contrast ranges (right). Left: Sensitivity increases with size only for kidney tumors, while liver and pancreatic lesions show no consistent trend. Right: Higher HU contrast leads to higher sensitivity across all lesion types, indicating that intensity-based features significantly affect detection performance.
  • ...and 2 more figures