Bridging the Semantic Gaps: Improving Medical VQA Consistency with LLM-Augmented Question Sets
Yongpei Ma, Pengyu Wang, Adam Dunn, Usman Naseem, Jinman Kim
TL;DR
This work tackles MVQA consistency under linguistic variation by introducing Semantically Equivalent Question Augmentation (SEQA), which uses large language models to generate paraphrases that preserve semantics. It introduces TAR-SC alongside three diversity metrics (ANQI, ANQA, ANQS) and demonstrates dataset augmentation on SLAKE, VQA-RAD, and PathVQA, leading to improved accuracy and consistency after fine-tuning across three MVQA models. The study shows that combining semantic invariance with factual correctness yields a more reliable evaluation framework for medical VQA, with reported gains of approximately 19.35% in accuracy and 11.61% in TAR-SC on augmented data. These findings highlight the practical benefits of paraphrase-aware data augmentation for clinical reasoning and provide a blueprint for evaluating semantic robustness in MVQA systems.
Abstract
Medical Visual Question Answering (MVQA) systems can interpret medical images in response to natural language queries. However, linguistic variability in question phrasing often undermines the consistency of these systems. To address this challenge, we propose a Semantically Equivalent Question Augmentation (SEQA) framework, which leverages large language models (LLMs) to generate diverse yet semantically equivalent rephrasings of questions. Specifically, this approach enriches linguistic diversity while preserving semantic meaning. We further introduce an evaluation metric, Total Agreement Rate with Semantically Equivalent Input and Correct Answer (TAR-SC), which assesses a model's capability to generate consistent and correct responses to semantically equivalent linguistic variations. In addition, we also propose three other diversity metrics - average number of QA items per image (ANQI), average number of questions per image with the same answer (ANQA), and average number of open-ended questions per image with the same semantics (ANQS). Using the SEQA framework, we augmented the benchmarked MVQA public datasets of SLAKE, VQA-RAD, and PathVQA. As a result, all three datasets achieved significant improvements by incorporating more semantically equivalent questions: ANQI increased by an average of 86.1, ANQA by 85.1, and ANQS by 46. Subsequent experiments evaluate three MVQA models (M2I2, MUMC, and BiomedGPT) under both zero-shot and fine-tuning settings on the enhanced datasets. Experimental results in MVQA datasets show that fine-tuned models achieve an average accuracy improvement of 19.35%, while our proposed TAR-SC metric shows an average improvement of 11. 61%, indicating a substantial enhancement in model consistency.
