Table of Contents
Fetching ...

H-POPE: Hierarchical Polling-based Probing Evaluation of Hallucinations in Large Vision-Language Models

Nhi Pham, Michael Schott

TL;DR

H-POPE is proposed, a coarse-to-fine-grained benchmark that systematically assesses hallucination in object existence and attributes and investigates whether these models rely on visual input to formulate the output texts.

Abstract

By leveraging both texts and images, large vision language models (LVLMs) have shown significant progress in various multi-modal tasks. Nevertheless, these models often suffer from hallucinations, e.g., they exhibit inconsistencies between the visual input and the textual output. To address this, we propose H-POPE, a coarse-to-fine-grained benchmark that systematically assesses hallucination in object existence and attributes. Our evaluation shows that models are prone to hallucinations on object existence, and even more so on fine-grained attributes. We further investigate whether these models rely on visual input to formulate the output texts.

H-POPE: Hierarchical Polling-based Probing Evaluation of Hallucinations in Large Vision-Language Models

TL;DR

H-POPE is proposed, a coarse-to-fine-grained benchmark that systematically assesses hallucination in object existence and attributes and investigates whether these models rely on visual input to formulate the output texts.

Abstract

By leveraging both texts and images, large vision language models (LVLMs) have shown significant progress in various multi-modal tasks. Nevertheless, these models often suffer from hallucinations, e.g., they exhibit inconsistencies between the visual input and the textual output. To address this, we propose H-POPE, a coarse-to-fine-grained benchmark that systematically assesses hallucination in object existence and attributes. Our evaluation shows that models are prone to hallucinations on object existence, and even more so on fine-grained attributes. We further investigate whether these models rely on visual input to formulate the output texts.

Paper Structure

This paper contains 19 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Overview of our H-POPE benchmark
  • Figure 2: Overview of our image-based adversarial sampling strategy. For a given object, e.g. the object road sign, we randomly select an attribute from another object in the image, e.g. the attribute green from the object tree. We then ask whether the original object has that attribute.
  • Figure 3: Difference in performance when asking our questions sequentially in a chat vs. asking them individually without context.
  • Figure 4: Pipeline for aggregating and visualizing the obtained relevance maps. We select the relevance maps for tokens that are either "Yes", "No" or name the object or attribute we asked about. The resulting maps are averaged. Afterwards, in line with stan2024lvlminterpretinterpretabilitytoollarge, the maps are up-scaled to match the image size and plotted as a heatmap on top of the image.
  • Figure 5: Examples of relevance maps for correct answers (in green) and incorrect answers (in red).