Table of Contents
Fetching ...

3D-RAD: A Comprehensive 3D Radiology Med-VQA Dataset with Multi-Temporal Analysis and Diverse Diagnostic Tasks

Xiaotang Gai, Jiaxiang Liu, Yichen Li, Zijie Meng, Jian Wu, Zuozhu Liu

TL;DR

3D-RAD introduces a large-scale 3D radiology VQA benchmark built on CT data to address the scarcity of volumetric and temporally aware reasoning. The dataset comprises 170K QA pairs, with a dual structure of 3D-RAD-Bench (~34K QA on 2,662 images) and 3D-RAD-T (~136K QA on 13,526 images), and spans six tasks that require anomaly detection, descriptive observation, quantitative computations, existence checks, and both static plus longitudinal temporal reasoning. An LLM-assisted annotation pipeline with human validation delivers high-quality, expert-aligned QA, enabling rigorous benchmarking of 3D vision–language models under zero-shot and fine-tuning regimes. Experimental results show that current 3D VLMs generalize poorly to multi-temporal tasks in zero-shot settings, but domain-specific fine-tuning on 3D-RAD-T substantially boosts performance, particularly for temporal tasks (Task 5 and 6), highlighting the dataset’s value in guiding future method development for accurate, temporally informed 3D medical VQA. Overall, 3D-RAD provides a scalable, extensible platform to advance 3D multimodal clinical reasoning and supports more reliable AI-assisted radiology workflows.

Abstract

Medical Visual Question Answering (Med-VQA) holds significant potential for clinical decision support, yet existing efforts primarily focus on 2D imaging with limited task diversity. This paper presents 3D-RAD, a large-scale dataset designed to advance 3D Med-VQA using radiology CT scans. The 3D-RAD dataset encompasses six diverse VQA tasks: anomaly detection, image observation, medical computation, existence detection, static temporal diagnosis, and longitudinal temporal diagnosis. It supports both open- and closed-ended questions while introducing complex reasoning challenges, including computational tasks and multi-stage temporal analysis, to enable comprehensive benchmarking. Extensive evaluations demonstrate that existing vision-language models (VLMs), especially medical VLMs exhibit limited generalization, particularly in multi-temporal tasks, underscoring the challenges of real-world 3D diagnostic reasoning. To drive future advancements, we release a high-quality training set 3D-RAD-T of 136,195 expert-aligned samples, showing that fine-tuning on this dataset could significantly enhance model performance. Our dataset and code, aiming to catalyze multimodal medical AI research and establish a robust foundation for 3D medical visual understanding, are publicly available at https://github.com/Tang-xiaoxiao/3D-RAD.

3D-RAD: A Comprehensive 3D Radiology Med-VQA Dataset with Multi-Temporal Analysis and Diverse Diagnostic Tasks

TL;DR

3D-RAD introduces a large-scale 3D radiology VQA benchmark built on CT data to address the scarcity of volumetric and temporally aware reasoning. The dataset comprises 170K QA pairs, with a dual structure of 3D-RAD-Bench (~34K QA on 2,662 images) and 3D-RAD-T (~136K QA on 13,526 images), and spans six tasks that require anomaly detection, descriptive observation, quantitative computations, existence checks, and both static plus longitudinal temporal reasoning. An LLM-assisted annotation pipeline with human validation delivers high-quality, expert-aligned QA, enabling rigorous benchmarking of 3D vision–language models under zero-shot and fine-tuning regimes. Experimental results show that current 3D VLMs generalize poorly to multi-temporal tasks in zero-shot settings, but domain-specific fine-tuning on 3D-RAD-T substantially boosts performance, particularly for temporal tasks (Task 5 and 6), highlighting the dataset’s value in guiding future method development for accurate, temporally informed 3D medical VQA. Overall, 3D-RAD provides a scalable, extensible platform to advance 3D multimodal clinical reasoning and supports more reliable AI-assisted radiology workflows.

Abstract

Medical Visual Question Answering (Med-VQA) holds significant potential for clinical decision support, yet existing efforts primarily focus on 2D imaging with limited task diversity. This paper presents 3D-RAD, a large-scale dataset designed to advance 3D Med-VQA using radiology CT scans. The 3D-RAD dataset encompasses six diverse VQA tasks: anomaly detection, image observation, medical computation, existence detection, static temporal diagnosis, and longitudinal temporal diagnosis. It supports both open- and closed-ended questions while introducing complex reasoning challenges, including computational tasks and multi-stage temporal analysis, to enable comprehensive benchmarking. Extensive evaluations demonstrate that existing vision-language models (VLMs), especially medical VLMs exhibit limited generalization, particularly in multi-temporal tasks, underscoring the challenges of real-world 3D diagnostic reasoning. To drive future advancements, we release a high-quality training set 3D-RAD-T of 136,195 expert-aligned samples, showing that fine-tuning on this dataset could significantly enhance model performance. Our dataset and code, aiming to catalyze multimodal medical AI research and establish a robust foundation for 3D medical visual understanding, are publicly available at https://github.com/Tang-xiaoxiao/3D-RAD.

Paper Structure

This paper contains 36 sections, 35 figures, 8 tables.

Figures (35)

  • Figure 1: Qualitative Comparison Across Dataset Types. Prior work (e.g., VQA-RAD; top) focuses on 2D open- and closed-ended VQA, whereas our 3D-RAD (bottom) additionally includes 3D imaging and multi-temporal tasks.
  • Figure 2: Definitions of Open-Ended and Closed-Ended Tasks in 3D-RAD. Different colors indicate distinct tasks; items sharing the same color represent different subtasks within that task.
  • Figure 3: 3D-RAD Dataset Construction Pipeline. Left: meta dataset with a 3D scan, clinical report, and structured labels. Middle: QA construction—open-ended (Tasks 1–2) from selected report sentences with a five-dimension quality check; numeric (Task 3) from measurements; closed-ended (Tasks 4–6) via prompt templates and choice lists, with temporal indices for Task 6. Right: representative QA examples for Tasks 1–6.
  • Figure 4: A Concise Prompt Example for Tasks 1–2. See \ref{['fig:task1-2']} for the full prompt.
  • Figure 5: Data Distribution of 3D-RAD Dataset. Left: Organ distribution derived from named‑entity recognition (NER) over all 3D‑RAD QA pairs, showing coverage of major thoracic and upper‑abdominal structures. Right: QA‑based NER term statistics—bar chart of the most frequent entities/concepts alongside a word cloud that highlights the broader long‑tail vocabulary of anatomical and pathological descriptors.
  • ...and 30 more figures