Table of Contents
Fetching ...

BioD2C: A Dual-level Semantic Consistency Constraint Framework for Biomedical VQA

Zhengyang Ji, Shang Gao, Li Liu, Yifan Jia, Yutao Yue

TL;DR

BioD2C addresses the misalignment between image and text in biomedical VQA by enforcing semantic consistency at two levels: a feature-level image-text fusion that conditions visual features on the question, and a text-queue based cross-modal loss that aligns the fused visual semantics with the question semantics. The method uses a multi-scale image feature extractor, a text-conditioned fusion via a Transformer, and a KL-divergence-based semantic loss, trained in a two-stage regime with the BioVGQ dataset, which provides cleaner, context-rich image-question pairs. Empirical results show state-of-the-art performance across SLAKE, Path-VQA, and RAD-VQA benchmarks and robust ablations highlighting the importance of the semantic loss, fusion mechanism, and dataset quality. The approach has strong potential for clinical decision support by enabling more accurate and region-focused visual reasoning in biomedical contexts.

Abstract

Biomedical visual question answering (VQA) has been widely studied and has demonstrated significant application value and potential in fields such as assistive medical diagnosis. Despite their success, current biomedical VQA models perform multimodal information interaction only at the model level within large language models (LLMs), leading to suboptimal multimodal semantic alignment when dealing with complex tasks. To address this issue, we propose BioD2C: a novel Dual-level Semantic Consistency Constraint Framework for Biomedical VQA, which achieves dual-level semantic interaction alignment at both the model and feature levels, enabling the model to adaptively learn visual features based on the question. Specifically, we firstly integrate textual features into visual features via an image-text fusion mechanism as feature-level semantic interaction, obtaining visual features conditioned on the given text; and then introduce a text-queue-based cross-modal soft semantic loss function to further align the image semantics with the question semantics. Specifically, in this work, we establish a new dataset, BioVGQ, to address inherent biases in prior datasets by filtering manually-altered images and aligning question-answer pairs with multimodal context, and train our model on this dataset. Extensive experimental results demonstrate that BioD2C achieves state-of-the-art (SOTA) performance across multiple downstream datasets, showcasing its robustness, generalizability, and potential to advance biomedical VQA research.

BioD2C: A Dual-level Semantic Consistency Constraint Framework for Biomedical VQA

TL;DR

BioD2C addresses the misalignment between image and text in biomedical VQA by enforcing semantic consistency at two levels: a feature-level image-text fusion that conditions visual features on the question, and a text-queue based cross-modal loss that aligns the fused visual semantics with the question semantics. The method uses a multi-scale image feature extractor, a text-conditioned fusion via a Transformer, and a KL-divergence-based semantic loss, trained in a two-stage regime with the BioVGQ dataset, which provides cleaner, context-rich image-question pairs. Empirical results show state-of-the-art performance across SLAKE, Path-VQA, and RAD-VQA benchmarks and robust ablations highlighting the importance of the semantic loss, fusion mechanism, and dataset quality. The approach has strong potential for clinical decision support by enabling more accurate and region-focused visual reasoning in biomedical contexts.

Abstract

Biomedical visual question answering (VQA) has been widely studied and has demonstrated significant application value and potential in fields such as assistive medical diagnosis. Despite their success, current biomedical VQA models perform multimodal information interaction only at the model level within large language models (LLMs), leading to suboptimal multimodal semantic alignment when dealing with complex tasks. To address this issue, we propose BioD2C: a novel Dual-level Semantic Consistency Constraint Framework for Biomedical VQA, which achieves dual-level semantic interaction alignment at both the model and feature levels, enabling the model to adaptively learn visual features based on the question. Specifically, we firstly integrate textual features into visual features via an image-text fusion mechanism as feature-level semantic interaction, obtaining visual features conditioned on the given text; and then introduce a text-queue-based cross-modal soft semantic loss function to further align the image semantics with the question semantics. Specifically, in this work, we establish a new dataset, BioVGQ, to address inherent biases in prior datasets by filtering manually-altered images and aligning question-answer pairs with multimodal context, and train our model on this dataset. Extensive experimental results demonstrate that BioD2C achieves state-of-the-art (SOTA) performance across multiple downstream datasets, showcasing its robustness, generalizability, and potential to advance biomedical VQA research.

Paper Structure

This paper contains 13 sections, 5 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: (a) and (b) illustrate the performance of the model-level interaction framework and BioD2C under image-related questions, respectively. Red text represents incorrect answers, while green text represents correct answers.
  • Figure 2: BioD2C Architecture.Feature-level Interaction: Medical images and text questions are encoded into features $X_v$ and $X_t$. A multi-scale enhanced $X_v$ is fused with $X_t$ via a Transformer decoder, generating $X_{vt}$, which is then combined with $X_v$ through a gating mechanism to produce text-conditioned features $X_{v|t}$. Semantic Loss: A text-queue loss guides $X_{v|t}$ to align with $X_t$.
  • Figure 3: Visualization of the attention map of the input image.