VinDr-CXR-VQA: A Visual Question Answering Dataset for Explainable Chest X-Ray Analysis with Multi-Task Learning

Dang H. Nguyen; Hieu H. Pham; Hao T. Nguyen; Hieu H. Pham

VinDr-CXR-VQA: A Visual Question Answering Dataset for Explainable Chest X-Ray Analysis with Multi-Task Learning

Dang H. Nguyen, Hieu H. Pham, Hao T. Nguyen, Hieu H. Pham

TL;DR

VinDr-CXR-VQA introduces a large, clinically grounded Med-VQA dataset that ties question answering to spatial grounding and radiologist-provided reasoning for chest X-rays. The authors generate six question types and preserve expert bounding boxes, enabling end-to-end multi-task learning and explainable results. Fine-tuning a medical VLM with LoRA on this data yields an $F1$ improvement to $0.624$ and a mean IoU of $0.615$ for grounding, with notable localization gains ($IoU\ge0.5$ in $22.8\%$ and $IoU\ge0.3$ in $48.6\%$ of cases). This work advances reproducible, interpretable Med-VQA and highlights future directions for dense multi-lesion training and broader clinical validation.

Abstract

We present VinDr-CXR-VQA, a large-scale chest X-ray dataset for explainable Medical Visual Question Answering (Med-VQA) with spatial grounding. The dataset contains 17,597 question-answer pairs across 4,394 images, each annotated with radiologist-verified bounding boxes and clinical reasoning explanations. Our question taxonomy spans six diagnostic types-Where, What, Is there, How many, Which, and Yes/No-capturing diverse clinical intents. To improve reliability, we construct a balanced distribution of 41.7% positive and 58.3% negative samples, mitigating hallucinations in normal cases. Benchmarking with MedGemma-4B-it demonstrates improved performance (F1 = 0.624, +11.8% over baseline) while enabling lesion localization. VinDr-CXR-VQA aims to advance reproducible and clinically grounded Med-VQA research. The dataset and evaluation tools are publicly available at huggingface.co/datasets/Dangindev/VinDR-CXR-VQA.

VinDr-CXR-VQA: A Visual Question Answering Dataset for Explainable Chest X-Ray Analysis with Multi-Task Learning

TL;DR

Abstract

VinDr-CXR-VQA: A Visual Question Answering Dataset for Explainable Chest X-Ray Analysis with Multi-Task Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)