Table of Contents
Fetching ...

THRONE: An Object-based Hallucination Benchmark for the Free-form Generations of Large Vision-Language Models

Prannay Kaul, Zhizhong Li, Hao Yang, Yonatan Dukler, Ashwin Swaminathan, C. J. Taylor, Stefano Soatto

TL;DR

It is shown that an improvement in existing metrics do not lead to a reduction in Type I hallucinations, and that established benchmarks for measuring Type I hallucinations are incomplete, and that a simple and effective data augmentation method to reduce Type I and Type II hallucinations is provided.

Abstract

Mitigating hallucinations in large vision-language models (LVLMs) remains an open problem. Recent benchmarks do not address hallucinations in open-ended free-form responses, which we term "Type I hallucinations". Instead, they focus on hallucinations responding to very specific question formats -- typically a multiple-choice response regarding a particular object or attribute -- which we term "Type II hallucinations". Additionally, such benchmarks often require external API calls to models which are subject to change. In practice, we observe that a reduction in Type II hallucinations does not lead to a reduction in Type I hallucinations but rather that the two forms of hallucinations are often anti-correlated. To address this, we propose THRONE, a novel object-based automatic framework for quantitatively evaluating Type I hallucinations in LVLM free-form outputs. We use public language models (LMs) to identify hallucinations in LVLM responses and compute informative metrics. By evaluating a large selection of recent LVLMs using public datasets, we show that an improvement in existing metrics do not lead to a reduction in Type I hallucinations, and that established benchmarks for measuring Type I hallucinations are incomplete. Finally, we provide a simple and effective data augmentation method to reduce Type I and Type II hallucinations as a strong baseline. Code is now available at https://github.com/amazon-science/THRONE .

THRONE: An Object-based Hallucination Benchmark for the Free-form Generations of Large Vision-Language Models

TL;DR

It is shown that an improvement in existing metrics do not lead to a reduction in Type I hallucinations, and that established benchmarks for measuring Type I hallucinations are incomplete, and that a simple and effective data augmentation method to reduce Type I and Type II hallucinations is provided.

Abstract

Mitigating hallucinations in large vision-language models (LVLMs) remains an open problem. Recent benchmarks do not address hallucinations in open-ended free-form responses, which we term "Type I hallucinations". Instead, they focus on hallucinations responding to very specific question formats -- typically a multiple-choice response regarding a particular object or attribute -- which we term "Type II hallucinations". Additionally, such benchmarks often require external API calls to models which are subject to change. In practice, we observe that a reduction in Type II hallucinations does not lead to a reduction in Type I hallucinations but rather that the two forms of hallucinations are often anti-correlated. To address this, we propose THRONE, a novel object-based automatic framework for quantitatively evaluating Type I hallucinations in LVLM free-form outputs. We use public language models (LMs) to identify hallucinations in LVLM responses and compute informative metrics. By evaluating a large selection of recent LVLMs using public datasets, we show that an improvement in existing metrics do not lead to a reduction in Type I hallucinations, and that established benchmarks for measuring Type I hallucinations are incomplete. Finally, we provide a simple and effective data augmentation method to reduce Type I and Type II hallucinations as a strong baseline. Code is now available at https://github.com/amazon-science/THRONE .
Paper Structure (1 section, 7 equations, 7 figures, 8 tables)

This paper contains 1 section, 7 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: THRONE (Ours): LVLMs are prompted with a concept neutral instruction. An external LM performs abstractive QA on the response to establish the existence of Type I hallucinations.
  • Figure 2: POPE: Questions with specific concepts prompt an LVLM directly to evaluate Type II hallucinations li2023evaluating. Hand-crafted rules parse LVLM responses to give yes/no labels.
  • Figure 3: Type I vs. Type II Hallucinations: (Top) LVLMs prompted with concept-neutral instructions produce Type I hallucinations. (Bottom) Instructions specifying a concept produce Type II hallucinations. Examples from LLaVA-v1.5 liu2023improved.
  • Figure 4: A Comparison of POPE, CHAIR and THRONE: Directly querying LVLMs for object existence (person, banana etc.) using concept-specific instructions, as in POPE (bottom left), does not produce the same hallucinations as using concept-neutral instructions (right). We highlight the Type I hallucinations in orange. CHAIR relies on exact text matching to a fixed set of objects and synonyms, thus incorrectly labels "customers" and "shoppers" as hallucinations, highlighted in red. THRONE is designed for the rich vocabulary and the free-form generations of modern LVLMs by harnessing LMs to establish object existence. By using an LM to pass judgement, our evaluation correctly captures "customers" and "shoppers" as hypothetical content in the free-form generation.
  • Figure 5: AQA Ensembling in Evaluation: Using different LMs or different prompts when running AQA on LVLM generated responses can produce opposing answers to identical prompts or identical LMs. To ensure THRONE is robust to this, we ensemble multiple LMs and multiple prompts in our evaluation pipeline.
  • ...and 2 more figures