Table of Contents
Fetching ...

DiN: Diffusion Model for Robust Medical VQA with Semantic Noisy Labels

Erjian Guo, Zhen Zhao, Zicheng Wang, Tong Chen, Yunyi Liu, Luping Zhou

TL;DR

This work tackles Med-VQA under noisy labels by introducing DiN, a diffusion-based framework that treats answer prediction as a coarse-to-fine generative process. It integrates an Answer Condition Generator to condition the diffusion process, a Noisy Label Refinement module with a robust loss and pseudo-label strategy, and an Answer Diffuser that performs diffusion-based answer classification, trained end-to-end with a combined loss. The approach is validated on VQA-RAD and PathVQA under semantic and random noise, showing superior robustness and accuracy compared with strong baselines. The proposed framework advances robust, medically reliable VQA under label noise, with practical implications for clinical interpretation and education tools where annotation quality can be variable.

Abstract

Medical Visual Question Answering (Med-VQA) systems benefit the interpretation of medical images containing critical clinical information. However, the challenge of noisy labels and limited high-quality datasets remains underexplored. To address this, we establish the first benchmark for noisy labels in Med-VQA by simulating human mislabeling with semantically designed noise types. More importantly, we introduce the DiN framework, which leverages a diffusion model to handle noisy labels in Med-VQA. Unlike the dominant classification-based VQA approaches that directly predict answers, our Answer Diffuser (AD) module employs a coarse-to-fine process, refining answer candidates with a diffusion model for improved accuracy. The Answer Condition Generator (ACG) further enhances this process by generating task-specific conditional information via integrating answer embeddings with fused image-question features. To address label noise, our Noisy Label Refinement(NLR) module introduces a robust loss function and dynamic answer adjustment to further boost the performance of the AD module.

DiN: Diffusion Model for Robust Medical VQA with Semantic Noisy Labels

TL;DR

This work tackles Med-VQA under noisy labels by introducing DiN, a diffusion-based framework that treats answer prediction as a coarse-to-fine generative process. It integrates an Answer Condition Generator to condition the diffusion process, a Noisy Label Refinement module with a robust loss and pseudo-label strategy, and an Answer Diffuser that performs diffusion-based answer classification, trained end-to-end with a combined loss. The approach is validated on VQA-RAD and PathVQA under semantic and random noise, showing superior robustness and accuracy compared with strong baselines. The proposed framework advances robust, medically reliable VQA under label noise, with practical implications for clinical interpretation and education tools where annotation quality can be variable.

Abstract

Medical Visual Question Answering (Med-VQA) systems benefit the interpretation of medical images containing critical clinical information. However, the challenge of noisy labels and limited high-quality datasets remains underexplored. To address this, we establish the first benchmark for noisy labels in Med-VQA by simulating human mislabeling with semantically designed noise types. More importantly, we introduce the DiN framework, which leverages a diffusion model to handle noisy labels in Med-VQA. Unlike the dominant classification-based VQA approaches that directly predict answers, our Answer Diffuser (AD) module employs a coarse-to-fine process, refining answer candidates with a diffusion model for improved accuracy. The Answer Condition Generator (ACG) further enhances this process by generating task-specific conditional information via integrating answer embeddings with fused image-question features. To address label noise, our Noisy Label Refinement(NLR) module introduces a robust loss function and dynamic answer adjustment to further boost the performance of the AD module.

Paper Structure

This paper contains 22 sections, 9 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Comparison between our diffusion-based method and previous Med-VQA approaches. Previous Med-VQA methods, designed for clean label datasets, include classification-based (a) and generation-based (b) approaches. In contrast, our diffusion-based method (c) classifies answers from a generative perspective and denoises answers by Answer Diffuser with the support of Answer Condition Generator and the Noisy Label Refinement module (employed only during the training process).
  • Figure 2: Visualization of the semantic answer pairs in t-SNE and examples of noisy answers. Similar colors represent the degree of feature similarity from a pretrained BERT model. We randomly display 8 semantic pairs in the t-SNE plots of VQA-RAD dataset. Ground-truth labels are replaced with their semantic pairs, which we refer to as semantically noisy answers zhang2023learning.
  • Figure 3: The proposed DiN framework consists of three key modules: 1) Answer Condition Generator (ACG): This module interacts the image and question multi-modal features with features of the image-question pair knowledge from Answer Condition Embedding (ACE) to obtain Med-VQA condition information. 2) Noisy Label Refinement (NLR): This module contain a robust loss function, $\mathcal{L}_{RFL}$, which supervises the proto-answer, to mitigate the impact of noisy original answers on the two encoders' acquisition of medical domain knowledge. The Answer adjustment of NLR uses proto-answers and original noisy anwers to generate pseudo-label to supervise the AD module. 3) Answer Diffuser (AD): This module refines the noisy answer distribution, simulating a generation process to select the correct answers. Notably, we use only the AD Module to predict answers without the NLR module during inference process.
  • Figure 4: Visualization of example results from our DiN framework on the VQA-RAD and PathVQA datasets. The first row presents examples from the training set with 10%-Semantic Noise, where our method corrects the noisy incorrect answers during the training process. The second row shows two examples from the test sets of both datasets.