Table of Contents
Fetching ...

Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models

Chaoya Jiang, Hongrui Jia, Wei Ye, Mengfan Dong, Haiyang Xu, Ming Yan, Ji Zhang, Shikun Zhang

TL;DR

This work expands LVLM evaluation by introducing Event Hallucination as a distinct, more complex category and unifying discriminative and generative evaluation in a single framework (Hal-Eval). It leverages an automatic GPT-4–driven annotation pipeline (AFHA) to create Hal-Data and trains Hal-Evaluator for reference-free generative assessment, while also enabling discriminative testing with standardized prompts. Across six LVLMs, the study shows event hallucinations grow with output length and that combining discriminative and generative methods provides a fuller picture, with Chain-of-Thought prompting mitigating hallucinations in some cases. The Hal-Data–driven fine-tuning of LVLMs (Hal-VL) demonstrates improved robustness against hallucinations and gains in general benchmarks, highlighting practical pathways for deploying more reliable multimodal models.

Abstract

Large Vision Language Models exhibit remarkable capabilities but struggle with hallucinations inconsistencies between images and their descriptions. Previous hallucination evaluation studies on LVLMs have identified hallucinations in terms of objects, attributes, and relations but overlooked complex hallucinations that create an entire narrative around a fictional entity. In this paper, we introduce a refined taxonomy of hallucinations, featuring a new category: Event Hallucination. We then utilize advanced LLMs to generate and filter fine grained hallucinatory data consisting of various types of hallucinations, with a particular focus on event hallucinations, laying the groundwork for integrating discriminative and generative evaluation methods within our universal evaluation framework. The proposed benchmark distinctively assesses LVLMs ability to tackle a broad spectrum of hallucinations, making it a reliable and comprehensive tool for gauging LVLMs efficacy in handling hallucinations. We will release our code and data.

Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models

TL;DR

This work expands LVLM evaluation by introducing Event Hallucination as a distinct, more complex category and unifying discriminative and generative evaluation in a single framework (Hal-Eval). It leverages an automatic GPT-4–driven annotation pipeline (AFHA) to create Hal-Data and trains Hal-Evaluator for reference-free generative assessment, while also enabling discriminative testing with standardized prompts. Across six LVLMs, the study shows event hallucinations grow with output length and that combining discriminative and generative methods provides a fuller picture, with Chain-of-Thought prompting mitigating hallucinations in some cases. The Hal-Data–driven fine-tuning of LVLMs (Hal-VL) demonstrates improved robustness against hallucinations and gains in general benchmarks, highlighting practical pathways for deploying more reliable multimodal models.

Abstract

Large Vision Language Models exhibit remarkable capabilities but struggle with hallucinations inconsistencies between images and their descriptions. Previous hallucination evaluation studies on LVLMs have identified hallucinations in terms of objects, attributes, and relations but overlooked complex hallucinations that create an entire narrative around a fictional entity. In this paper, we introduce a refined taxonomy of hallucinations, featuring a new category: Event Hallucination. We then utilize advanced LLMs to generate and filter fine grained hallucinatory data consisting of various types of hallucinations, with a particular focus on event hallucinations, laying the groundwork for integrating discriminative and generative evaluation methods within our universal evaluation framework. The proposed benchmark distinctively assesses LVLMs ability to tackle a broad spectrum of hallucinations, making it a reliable and comprehensive tool for gauging LVLMs efficacy in handling hallucinations. We will release our code and data.
Paper Structure (41 sections, 12 figures, 14 tables)

This paper contains 41 sections, 12 figures, 14 tables.

Figures (12)

  • Figure 1: Different types of hallucination. Event hallucination, which involves more complex vision-language discrepancy compared to other types of hallucination, is commonly overlooked by previous efforts.
  • Figure 2: The left sub-figure shows the ratios of various hallucinations in mPLUG-owl's image descriptions with token lengths under 20. The right sub-figure presents these ratios for descriptions exceeding 20 tokens.
  • Figure 3: This figure provides a schematic of the discriminative evaluation and generative evaluation used in Hal-Eval.
  • Figure 4: Comparison of LLaVA1.5 and LLaVA 1.5-COT. We report the F1 score for both of them.
  • Figure 5: The left sub-figure displays the results of the discriminative evaluation for GPT-4V and Hal-Evaluator. The right sub-figure compares the ROUGE-L between hallucination content detected by GPT-4V and Hal-Evaluator with the annotated hallucination content.
  • ...and 7 more figures