Table of Contents
Fetching ...

Bridging the Semantic Gaps: Improving Medical VQA Consistency with LLM-Augmented Question Sets

Yongpei Ma, Pengyu Wang, Adam Dunn, Usman Naseem, Jinman Kim

TL;DR

This work tackles MVQA consistency under linguistic variation by introducing Semantically Equivalent Question Augmentation (SEQA), which uses large language models to generate paraphrases that preserve semantics. It introduces TAR-SC alongside three diversity metrics (ANQI, ANQA, ANQS) and demonstrates dataset augmentation on SLAKE, VQA-RAD, and PathVQA, leading to improved accuracy and consistency after fine-tuning across three MVQA models. The study shows that combining semantic invariance with factual correctness yields a more reliable evaluation framework for medical VQA, with reported gains of approximately 19.35% in accuracy and 11.61% in TAR-SC on augmented data. These findings highlight the practical benefits of paraphrase-aware data augmentation for clinical reasoning and provide a blueprint for evaluating semantic robustness in MVQA systems.

Abstract

Medical Visual Question Answering (MVQA) systems can interpret medical images in response to natural language queries. However, linguistic variability in question phrasing often undermines the consistency of these systems. To address this challenge, we propose a Semantically Equivalent Question Augmentation (SEQA) framework, which leverages large language models (LLMs) to generate diverse yet semantically equivalent rephrasings of questions. Specifically, this approach enriches linguistic diversity while preserving semantic meaning. We further introduce an evaluation metric, Total Agreement Rate with Semantically Equivalent Input and Correct Answer (TAR-SC), which assesses a model's capability to generate consistent and correct responses to semantically equivalent linguistic variations. In addition, we also propose three other diversity metrics - average number of QA items per image (ANQI), average number of questions per image with the same answer (ANQA), and average number of open-ended questions per image with the same semantics (ANQS). Using the SEQA framework, we augmented the benchmarked MVQA public datasets of SLAKE, VQA-RAD, and PathVQA. As a result, all three datasets achieved significant improvements by incorporating more semantically equivalent questions: ANQI increased by an average of 86.1, ANQA by 85.1, and ANQS by 46. Subsequent experiments evaluate three MVQA models (M2I2, MUMC, and BiomedGPT) under both zero-shot and fine-tuning settings on the enhanced datasets. Experimental results in MVQA datasets show that fine-tuned models achieve an average accuracy improvement of 19.35%, while our proposed TAR-SC metric shows an average improvement of 11. 61%, indicating a substantial enhancement in model consistency.

Bridging the Semantic Gaps: Improving Medical VQA Consistency with LLM-Augmented Question Sets

TL;DR

This work tackles MVQA consistency under linguistic variation by introducing Semantically Equivalent Question Augmentation (SEQA), which uses large language models to generate paraphrases that preserve semantics. It introduces TAR-SC alongside three diversity metrics (ANQI, ANQA, ANQS) and demonstrates dataset augmentation on SLAKE, VQA-RAD, and PathVQA, leading to improved accuracy and consistency after fine-tuning across three MVQA models. The study shows that combining semantic invariance with factual correctness yields a more reliable evaluation framework for medical VQA, with reported gains of approximately 19.35% in accuracy and 11.61% in TAR-SC on augmented data. These findings highlight the practical benefits of paraphrase-aware data augmentation for clinical reasoning and provide a blueprint for evaluating semantic robustness in MVQA systems.

Abstract

Medical Visual Question Answering (MVQA) systems can interpret medical images in response to natural language queries. However, linguistic variability in question phrasing often undermines the consistency of these systems. To address this challenge, we propose a Semantically Equivalent Question Augmentation (SEQA) framework, which leverages large language models (LLMs) to generate diverse yet semantically equivalent rephrasings of questions. Specifically, this approach enriches linguistic diversity while preserving semantic meaning. We further introduce an evaluation metric, Total Agreement Rate with Semantically Equivalent Input and Correct Answer (TAR-SC), which assesses a model's capability to generate consistent and correct responses to semantically equivalent linguistic variations. In addition, we also propose three other diversity metrics - average number of QA items per image (ANQI), average number of questions per image with the same answer (ANQA), and average number of open-ended questions per image with the same semantics (ANQS). Using the SEQA framework, we augmented the benchmarked MVQA public datasets of SLAKE, VQA-RAD, and PathVQA. As a result, all three datasets achieved significant improvements by incorporating more semantically equivalent questions: ANQI increased by an average of 86.1, ANQA by 85.1, and ANQS by 46. Subsequent experiments evaluate three MVQA models (M2I2, MUMC, and BiomedGPT) under both zero-shot and fine-tuning settings on the enhanced datasets. Experimental results in MVQA datasets show that fine-tuned models achieve an average accuracy improvement of 19.35%, while our proposed TAR-SC metric shows an average improvement of 11. 61%, indicating a substantial enhancement in model consistency.

Paper Structure

This paper contains 22 sections, 5 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of our proposed Semantically Equivalent Question Augmentation (SEQA) framework. Original question and image are used as an input into an LLM, such as Gemini or GPT-4, to generate linguistically diversified questions. These augmented questions, along with the image, are then fed into a vision-language model (VLM). If the model consistently provides the same answer for all the augmented questions, it is deemed as a consistent model.
  • Figure 2: Examples of different modalities (columns) in MVQA. The first row is the modality type, followed by the question and the list of the generated questions that have the same semantic meaning but with linguistic diversity with variations in syntax, structure, or phrasing.
  • Figure 3: Example of Semantically Equivalent Questions and Model Responses. "Original QA" refers to the original question and ground truth in the SLAKE dataset. "Variant Questions" are rephrased versions of the original question, designed to be semantically equivalent. The answers provided are generated by the MUMC model.
  • Figure 4: Distribution of Original Questions by Answer Consistency Levels. The x-axis represents answer consistency levels (the number of variation questions that provided the same answer), while the y-axis shows the number of original questions corresponding to each level.