Hallucination Benchmark in Medical Visual Question Answering
Jinge Wu, Yunsoo Kim, Honghan Wu
TL;DR
This work tackles hallucination in Medical Visual Question Answering (Med-VQA) by introducing a dedicated hallucination benchmark derived from three public VQA datasets, incorporating stress scenarios such as FAKE questions, NOTA, and Image SWAP. It systematically evaluates a range of Vision-Language models, including LLaVA variants and GPT-4-turbo-vision, under a rigorous prompting ablation framework, revealing that LLaVA-v1.5-13B offers strong robustness while NOTA remains challenging. The study identifies that prompting strategy, particularly the L + D0 configuration, significantly influences hallucination detection and that domain-specific fine-tuning does not universally boost robustness. Overall, the benchmark provides a baseline, practical insights, and ready-to-use code to assess and mitigate hallucination risks in medical visual assistants.
Abstract
The recent success of large language and vision models (LLVMs) on vision question answering (VQA), particularly their applications in medicine (Med-VQA), has shown a great potential of realizing effective visual assistants for healthcare. However, these models are not extensively tested on the hallucination phenomenon in clinical settings. Here, we created a hallucination benchmark of medical images paired with question-answer sets and conducted a comprehensive evaluation of the state-of-the-art models. The study provides an in-depth analysis of current models' limitations and reveals the effectiveness of various prompting strategies.
