DiN: Diffusion Model for Robust Medical VQA with Semantic Noisy Labels
Erjian Guo, Zhen Zhao, Zicheng Wang, Tong Chen, Yunyi Liu, Luping Zhou
TL;DR
This work tackles Med-VQA under noisy labels by introducing DiN, a diffusion-based framework that treats answer prediction as a coarse-to-fine generative process. It integrates an Answer Condition Generator to condition the diffusion process, a Noisy Label Refinement module with a robust loss and pseudo-label strategy, and an Answer Diffuser that performs diffusion-based answer classification, trained end-to-end with a combined loss. The approach is validated on VQA-RAD and PathVQA under semantic and random noise, showing superior robustness and accuracy compared with strong baselines. The proposed framework advances robust, medically reliable VQA under label noise, with practical implications for clinical interpretation and education tools where annotation quality can be variable.
Abstract
Medical Visual Question Answering (Med-VQA) systems benefit the interpretation of medical images containing critical clinical information. However, the challenge of noisy labels and limited high-quality datasets remains underexplored. To address this, we establish the first benchmark for noisy labels in Med-VQA by simulating human mislabeling with semantically designed noise types. More importantly, we introduce the DiN framework, which leverages a diffusion model to handle noisy labels in Med-VQA. Unlike the dominant classification-based VQA approaches that directly predict answers, our Answer Diffuser (AD) module employs a coarse-to-fine process, refining answer candidates with a diffusion model for improved accuracy. The Answer Condition Generator (ACG) further enhances this process by generating task-specific conditional information via integrating answer embeddings with fused image-question features. To address label noise, our Noisy Label Refinement(NLR) module introduces a robust loss function and dynamic answer adjustment to further boost the performance of the AD module.
