Table of Contents
Fetching ...

MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context

Zishan Gu, Changchang Yin, Fenglin Liu, Ping Zhang

TL;DR

MedVH introduces a dedicated benchmark for evaluating hallucinations in medical vision-language models, combining multimodal comprehension (MC-VQA) with long-form generation tasks (False Confidence Justification and Medical Report Generation). It demonstrates that domain-specific medical LVLMs, while strong on standard medical tasks, are more prone to hallucinations than general models, highlighting reliability concerns for clinical use. The framework includes a data-synthesis pipeline, defined metrics (acc_h, acc_b, char_score, CHAIR_I/CHAIR_S), and analyses of prompt design and temperature effects to guide robust model development. Overall, MedVH provides a comprehensive, cross-domain evaluation platform to drive development of trustworthy medical LVLMs and urges careful balancing of medical knowledge and reasoning during fine-tuning.

Abstract

Large Vision Language Models (LVLMs) have recently achieved superior performance in various tasks on natural image and text data, which inspires a large amount of studies for LVLMs fine-tuning and training. Despite their advancements, there has been scant research on the robustness of these models against hallucination when fine-tuned on smaller datasets. In this study, we introduce a new benchmark dataset, the Medical Visual Hallucination Test (MedVH), to evaluate the hallucination of domain-specific LVLMs. MedVH comprises five tasks to evaluate hallucinations in LVLMs within the medical context, which includes tasks for comprehensive understanding of textual and visual input, as well as long textual response generation. Our extensive experiments with both general and medical LVLMs reveal that, although medical LVLMs demonstrate promising performance on standard medical tasks, they are particularly susceptible to hallucinations, often more so than the general models, raising significant concerns about the reliability of these domain-specific models. For medical LVLMs to be truly valuable in real-world applications, they must not only accurately integrate medical knowledge but also maintain robust reasoning abilities to prevent hallucination. Our work paves the way for future evaluations of these studies.

MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context

TL;DR

MedVH introduces a dedicated benchmark for evaluating hallucinations in medical vision-language models, combining multimodal comprehension (MC-VQA) with long-form generation tasks (False Confidence Justification and Medical Report Generation). It demonstrates that domain-specific medical LVLMs, while strong on standard medical tasks, are more prone to hallucinations than general models, highlighting reliability concerns for clinical use. The framework includes a data-synthesis pipeline, defined metrics (acc_h, acc_b, char_score, CHAIR_I/CHAIR_S), and analyses of prompt design and temperature effects to guide robust model development. Overall, MedVH provides a comprehensive, cross-domain evaluation platform to drive development of trustworthy medical LVLMs and urges careful balancing of medical knowledge and reasoning during fine-tuning.

Abstract

Large Vision Language Models (LVLMs) have recently achieved superior performance in various tasks on natural image and text data, which inspires a large amount of studies for LVLMs fine-tuning and training. Despite their advancements, there has been scant research on the robustness of these models against hallucination when fine-tuned on smaller datasets. In this study, we introduce a new benchmark dataset, the Medical Visual Hallucination Test (MedVH), to evaluate the hallucination of domain-specific LVLMs. MedVH comprises five tasks to evaluate hallucinations in LVLMs within the medical context, which includes tasks for comprehensive understanding of textual and visual input, as well as long textual response generation. Our extensive experiments with both general and medical LVLMs reveal that, although medical LVLMs demonstrate promising performance on standard medical tasks, they are particularly susceptible to hallucinations, often more so than the general models, raising significant concerns about the reliability of these domain-specific models. For medical LVLMs to be truly valuable in real-world applications, they must not only accurately integrate medical knowledge but also maintain robust reasoning abilities to prevent hallucination. Our work paves the way for future evaluations of these studies.
Paper Structure (28 sections, 2 equations, 9 figures, 5 tables)

This paper contains 28 sections, 2 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Overall evaluation framework.
  • Figure 2: Detailed illustration of evaluation tasks in MedVH.
  • Figure 3: Results on MedVH dataset. (left) Accuracy of hallucination VQA tasks compared with accuracy of regular MC-VQA tasks. (right) Performance on characterization score considering the model size.
  • Figure 4: Variation in accuracy for different temperature values of Chat-GPT4V.
  • Figure 5: Variation in performance against hallucination for different wording of choices. Original means the ideal extra choice for the question, which should have been "This is not a suitable question for the image" for the Wrongful Image task and "The question contains a clinically incorrect premise" for the Clinically Incorrect Question task, respectively. NOTA indicates we substitute that choice with "None of the above".
  • ...and 4 more figures