Table of Contents
Fetching ...

VinDr-CXR-VQA: A Visual Question Answering Dataset for Explainable Chest X-Ray Analysis with Multi-Task Learning

Dang H. Nguyen, Hieu H. Pham, Hao T. Nguyen, Hieu H. Pham

TL;DR

VinDr-CXR-VQA introduces a large, clinically grounded Med-VQA dataset that ties question answering to spatial grounding and radiologist-provided reasoning for chest X-rays. The authors generate six question types and preserve expert bounding boxes, enabling end-to-end multi-task learning and explainable results. Fine-tuning a medical VLM with LoRA on this data yields an $F1$ improvement to $0.624$ and a mean IoU of $0.615$ for grounding, with notable localization gains ($IoU\ge0.5$ in $22.8\%$ and $IoU\ge0.3$ in $48.6\%$ of cases). This work advances reproducible, interpretable Med-VQA and highlights future directions for dense multi-lesion training and broader clinical validation.

Abstract

We present VinDr-CXR-VQA, a large-scale chest X-ray dataset for explainable Medical Visual Question Answering (Med-VQA) with spatial grounding. The dataset contains 17,597 question-answer pairs across 4,394 images, each annotated with radiologist-verified bounding boxes and clinical reasoning explanations. Our question taxonomy spans six diagnostic types-Where, What, Is there, How many, Which, and Yes/No-capturing diverse clinical intents. To improve reliability, we construct a balanced distribution of 41.7% positive and 58.3% negative samples, mitigating hallucinations in normal cases. Benchmarking with MedGemma-4B-it demonstrates improved performance (F1 = 0.624, +11.8% over baseline) while enabling lesion localization. VinDr-CXR-VQA aims to advance reproducible and clinically grounded Med-VQA research. The dataset and evaluation tools are publicly available at huggingface.co/datasets/Dangindev/VinDR-CXR-VQA.

VinDr-CXR-VQA: A Visual Question Answering Dataset for Explainable Chest X-Ray Analysis with Multi-Task Learning

TL;DR

VinDr-CXR-VQA introduces a large, clinically grounded Med-VQA dataset that ties question answering to spatial grounding and radiologist-provided reasoning for chest X-rays. The authors generate six question types and preserve expert bounding boxes, enabling end-to-end multi-task learning and explainable results. Fine-tuning a medical VLM with LoRA on this data yields an improvement to and a mean IoU of for grounding, with notable localization gains ( in and in of cases). This work advances reproducible, interpretable Med-VQA and highlights future directions for dense multi-lesion training and broader clinical validation.

Abstract

We present VinDr-CXR-VQA, a large-scale chest X-ray dataset for explainable Medical Visual Question Answering (Med-VQA) with spatial grounding. The dataset contains 17,597 question-answer pairs across 4,394 images, each annotated with radiologist-verified bounding boxes and clinical reasoning explanations. Our question taxonomy spans six diagnostic types-Where, What, Is there, How many, Which, and Yes/No-capturing diverse clinical intents. To improve reliability, we construct a balanced distribution of 41.7% positive and 58.3% negative samples, mitigating hallucinations in normal cases. Benchmarking with MedGemma-4B-it demonstrates improved performance (F1 = 0.624, +11.8% over baseline) while enabling lesion localization. VinDr-CXR-VQA aims to advance reproducible and clinically grounded Med-VQA research. The dataset and evaluation tools are publicly available at huggingface.co/datasets/Dangindev/VinDR-CXR-VQA.

Paper Structure

This paper contains 13 sections, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Overview of the VinDr-CXR-VQA pipeline. The figure illustrates the multi-task learning approach, encompassing dataset creation (unifying Question & Answer, Bounding Box, and Reasoning) and the downstream fine-tuning of MedGemma-4B-it, enabling explainable VQA with accurate spatial grounding.
  • Figure 2: Model comparison on VinDr-CXR validation images. Bounding boxes: ground truth (green), Predictions from MedGemma Pretrained (yellow), and Predictions from MedGemma Fine-tuned (red). (A) Left Image: Infiltration with high IoU (0.715). The Fine-tuned model achieves excellent localization, significantly outperforming the pretrained baseline. (B) Right Image: Consolidation with moderate IoU (0.482). The Fine-tuned model demonstrates improved accuracy over the pretrained baseline. Both cases show the fine-tuned model's superior spatial localization while maintaining clinical reasoning capability.