Table of Contents
Fetching ...

Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models

Bei Yan, Jie Zhang, Zheng Yuan, Shiguang Shan, Xilin Chen

TL;DR

This work addresses the challenge of evaluating hallucination in Large Vision-Language Models by proposing a psychometrics-inspired HQM framework that separately assesses reliability (test-retest and parallel-forms) and validity (criterion validity and coverage of hallucination types). Guided by HQM, the authors construct HQH, a large open-ended hallucination benchmark based on Visual Genome with 4000 image-instruction pairs across eight types, and adopt a binary, GPT-assisted evaluation to compute hallucination rates. Across nine open-source LVLMs and two strong closed-source models, HQH demonstrates superior reliability and competitive validity, while revealing persistent hallucination, particularly in existence, OCR, and complex relational tasks. The study argues for applying HQM to AI benchmarks more broadly and provides a publicly available benchmark resource to drive improvements in LVLM robustness and interpretability.

Abstract

Despite the rapid progress and outstanding performance of Large Vision-Language Models (LVLMs) in recent years, LVLMs have been plagued by the issue of hallucination, i.e., LVLMs tend to generate responses that are inconsistent with the corresponding visual inputs. To evaluate the degree of hallucination in LVLMs, previous works have proposed a series of benchmarks featuring different types of tasks and evaluation metrics. However, we find that the quality of the existing hallucination benchmarks varies, with some suffering from problems, e.g., inconsistent evaluation results under repeated tests, and misalignment with human evaluation. To this end, we propose a Hallucination benchmark Quality Measurement framework (HQM), which leverages various indicators to assess the reliability and validity of existing hallucination benchmarks separately. Specifically, for reliability we explore test-retest reliability and parallel-forms reliability, while for validity we examine criterion validity and coverage of hallucination types. Furthermore, based on the results of our quality measurement, we construct a High-Quality Hallucination Benchmark (HQH) for LVLMs, which demonstrates superior reliability and validity under our HQM framework. We conduct an extensive evaluation of over 10 representative LVLMs, including GPT-4o and Gemini-1.5-Pro, to provide an in-depth analysis of the hallucination issues in existing models. Our benchmark is publicly available at https://github.com/HQHBench/HQHBench.

Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models

TL;DR

This work addresses the challenge of evaluating hallucination in Large Vision-Language Models by proposing a psychometrics-inspired HQM framework that separately assesses reliability (test-retest and parallel-forms) and validity (criterion validity and coverage of hallucination types). Guided by HQM, the authors construct HQH, a large open-ended hallucination benchmark based on Visual Genome with 4000 image-instruction pairs across eight types, and adopt a binary, GPT-assisted evaluation to compute hallucination rates. Across nine open-source LVLMs and two strong closed-source models, HQH demonstrates superior reliability and competitive validity, while revealing persistent hallucination, particularly in existence, OCR, and complex relational tasks. The study argues for applying HQM to AI benchmarks more broadly and provides a publicly available benchmark resource to drive improvements in LVLM robustness and interpretability.

Abstract

Despite the rapid progress and outstanding performance of Large Vision-Language Models (LVLMs) in recent years, LVLMs have been plagued by the issue of hallucination, i.e., LVLMs tend to generate responses that are inconsistent with the corresponding visual inputs. To evaluate the degree of hallucination in LVLMs, previous works have proposed a series of benchmarks featuring different types of tasks and evaluation metrics. However, we find that the quality of the existing hallucination benchmarks varies, with some suffering from problems, e.g., inconsistent evaluation results under repeated tests, and misalignment with human evaluation. To this end, we propose a Hallucination benchmark Quality Measurement framework (HQM), which leverages various indicators to assess the reliability and validity of existing hallucination benchmarks separately. Specifically, for reliability we explore test-retest reliability and parallel-forms reliability, while for validity we examine criterion validity and coverage of hallucination types. Furthermore, based on the results of our quality measurement, we construct a High-Quality Hallucination Benchmark (HQH) for LVLMs, which demonstrates superior reliability and validity under our HQM framework. We conduct an extensive evaluation of over 10 representative LVLMs, including GPT-4o and Gemini-1.5-Pro, to provide an in-depth analysis of the hallucination issues in existing models. Our benchmark is publicly available at https://github.com/HQHBench/HQHBench.

Paper Structure

This paper contains 22 sections, 3 equations, 17 figures, 9 tables.

Figures (17)

  • Figure 1: Overview of our Hallucination benchmark Quality Measurement framework (HQM), assessing both reliability and validity. For reliability, we explore test-retest reliability and parallel-forms reliability, examining whether the evaluation results are consistent under repeated tests and parallel tests. For validity, we measure criterion validity and the coverage of hallucination types, focusing on whether the benchmark evaluation is aligned with human evaluation and comprehensive.
  • Figure 2: Leaderboards of mainstream open-source LVLMs on hallucination benchmarks.
  • Figure 3: Examples of image-instruction pairs for different hallucination types.
  • Figure 4: The prompt used in HQH evaluation.
  • Figure 5: Comparison of the hallucination rates $\downarrow$ of the top-8 LVLMs on different hallucination types. A smaller area indicates better performance.
  • ...and 12 more figures