Table of Contents
Fetching ...

RadImageNet-VQA: A Large-Scale CT and MRI Dataset for Radiologic Visual Question Answering

Léo Butsanets, Charles Corbière, Julien Khlaut, Pierre Manceron, Corentin Dancette

TL;DR

RadImageNet-VQA introduces a large-scale CT/MRI-focused radiologic VQA dataset with 750K images and 7.5M QA samples, spanning anatomy recognition, abnormality detection, and fine-grained pathology identification across 8 regions and 97 pathologies. The authors construct a radiology-aware captioning and VQA pipeline, plus a 1k-image, 9k-QA benchmark to rigorously evaluate image-grounded reasoning and minimize linguistic shortcuts. Zero-shot results reveal anatomy is near-solved while pathology identification remains a major bottleneck, and text-only analyses confirm reduced shortcut reliance on the new dataset. Fine-tuning across multiple vision-language models yields substantial gains, though medical pretraining of vision encoders offers limited advantages, underscoring the value of radiologic instruction tuning and dataset-scale for improving radiology VLMs.

Abstract

In this work, we introduce RadImageNet-VQA, a large-scale dataset designed to advance radiologic visual question answering (VQA) on CT and MRI exams. Existing medical VQA datasets are limited in scale, dominated by X-ray imaging or biomedical illustrations, and often prone to text-based shortcuts. RadImageNet-VQA is built from expert-curated annotations and provides 750K images paired with 7.5M question-answer samples. It covers three key tasks - abnormality detection, anatomy recognition, and pathology identification - spanning eight anatomical regions and 97 pathology categories, and supports open-ended, closed-ended, and multiple-choice questions. Extensive experiments show that state-of-the-art vision-language models still struggle with fine-grained pathology identification, particularly in open-ended settings and even after fine-tuning. Text-only analysis further reveals that model performance collapses to near-random without image inputs, confirming that RadImageNet-VQA is free from linguistic shortcuts. The full dataset and benchmark are publicly available at https://huggingface.co/datasets/raidium/RadImageNet-VQA.

RadImageNet-VQA: A Large-Scale CT and MRI Dataset for Radiologic Visual Question Answering

TL;DR

RadImageNet-VQA introduces a large-scale CT/MRI-focused radiologic VQA dataset with 750K images and 7.5M QA samples, spanning anatomy recognition, abnormality detection, and fine-grained pathology identification across 8 regions and 97 pathologies. The authors construct a radiology-aware captioning and VQA pipeline, plus a 1k-image, 9k-QA benchmark to rigorously evaluate image-grounded reasoning and minimize linguistic shortcuts. Zero-shot results reveal anatomy is near-solved while pathology identification remains a major bottleneck, and text-only analyses confirm reduced shortcut reliance on the new dataset. Fine-tuning across multiple vision-language models yields substantial gains, though medical pretraining of vision encoders offers limited advantages, underscoring the value of radiologic instruction tuning and dataset-scale for improving radiology VLMs.

Abstract

In this work, we introduce RadImageNet-VQA, a large-scale dataset designed to advance radiologic visual question answering (VQA) on CT and MRI exams. Existing medical VQA datasets are limited in scale, dominated by X-ray imaging or biomedical illustrations, and often prone to text-based shortcuts. RadImageNet-VQA is built from expert-curated annotations and provides 750K images paired with 7.5M question-answer samples. It covers three key tasks - abnormality detection, anatomy recognition, and pathology identification - spanning eight anatomical regions and 97 pathology categories, and supports open-ended, closed-ended, and multiple-choice questions. Extensive experiments show that state-of-the-art vision-language models still struggle with fine-grained pathology identification, particularly in open-ended settings and even after fine-tuning. Text-only analysis further reveals that model performance collapses to near-random without image inputs, confirming that RadImageNet-VQA is free from linguistic shortcuts. The full dataset and benchmark are publicly available at https://huggingface.co/datasets/raidium/RadImageNet-VQA.

Paper Structure

This paper contains 28 sections, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Overview of the RadImageNet-VQA dataset, which provides radiology-focused supervision across three VQA tasks (anatomy recognition, fine-grained pathology identification, and abnormality detection) using diverse open-ended, closed-ended, and multiple-choice formats. It also includes radiologic captioning pairs for image–text alignment.
  • Figure 2: RadImageNet-VQA construction pipeline. Expert-annotated CT/MRI images are converted into radiology-aware captions and VQA samples using task- and format-specific templates. The pipeline generates open-ended, closed-ended, and multiple-choice questions across anatomy recognition, abnormality detection, and pathology identification, with distractors designed to prevent shortcuts.
  • Figure 3: Composition of the RadImageNet-VQA benchmark, containing 1,000 CT/MRI images and 9,000 QA pairs: (a) most frequent pathology labels, (b) anatomy repartition, (c) question-type distribution.
  • Figure 4: Text-only analysis of multiple VLMs' accuracy for open-ended and MC questions on RadImageNet-VQA, VQA-RAD, SLAKE, and MMU-Med-val.
  • Figure 5: Comparison of base and fine-tuned accuracies on RadImageNet-VQA.
  • ...and 8 more figures