Table of Contents
Fetching ...

BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models

Moon Ye-Bin, Nam Hyeon-Woo, Wonseok Choi, Tae-Hyun Oh

TL;DR

BEAF introduces a change-aware evaluation framework for vision-language models by jointly manipulating visual scenes and text prompts to diagnose hallucination. It defines four metrics—True Understanding, Ignorance, Stubbornness, and Indecision—to capture how model answers shift with image edits, alongside a harmonic F1 score that combines TU and ID. The dataset comprises 26K image–question pairs with 500 original MS-COCO images and 1,727 manipulated variants created through object removal, enabling fine-grained analysis of scene understanding and object interactions. Experiments with zero-shot VLMs (e.g., LLaVA, InstructBLIP, Shikra, mPLUG-Owl) reveal that high traditional accuracy often coexists with hidden hallucinations, and the two-axis visualizations uncover inter-object dependencies influencing model outputs. BEAF thus provides a practical, change-aware benchmark that reveals nuanced failure modes, guides model improvement, and highlights limitations tied to dataset diversity and automated manipulation pipelines.

Abstract

Vision language models (VLMs) perceive the world through a combination of a visual encoder and a large language model (LLM). The visual encoder, pre-trained on large-scale vision-text datasets, provides zero-shot generalization to visual data, and the LLM endows its high reasoning ability to VLMs. It leads VLMs to achieve high performance on wide benchmarks without fine-tuning, exhibiting zero or few-shot capability. However, recent studies show that VLMs are vulnerable to hallucination. This undesirable behavior degrades reliability and credibility, thereby making users unable to fully trust the output from VLMs. To enhance trustworthiness and better tackle the hallucination of VLMs, we curate a new evaluation dataset, called the BEfore-AFter hallucination dataset (BEAF), and introduce new metrics: True Understanding (TU), IGnorance (IG), StuBbornness (SB), and InDecision (ID). Unlike prior works that focus only on constructing questions and answers, the key idea of our benchmark is to manipulate visual scene information by image editing models and to design the metrics based on scene changes. This allows us to clearly assess whether VLMs correctly understand a given scene by observing the ability to perceive changes. We also visualize image-wise object relationship by virtue of our two-axis view: vision and text. Upon evaluating VLMs with our dataset, we observed that our metrics reveal different aspects of VLM hallucination that have not been reported before. Project page: \url{https://beafbench.github.io/}

BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models

TL;DR

BEAF introduces a change-aware evaluation framework for vision-language models by jointly manipulating visual scenes and text prompts to diagnose hallucination. It defines four metrics—True Understanding, Ignorance, Stubbornness, and Indecision—to capture how model answers shift with image edits, alongside a harmonic F1 score that combines TU and ID. The dataset comprises 26K image–question pairs with 500 original MS-COCO images and 1,727 manipulated variants created through object removal, enabling fine-grained analysis of scene understanding and object interactions. Experiments with zero-shot VLMs (e.g., LLaVA, InstructBLIP, Shikra, mPLUG-Owl) reveal that high traditional accuracy often coexists with hidden hallucinations, and the two-axis visualizations uncover inter-object dependencies influencing model outputs. BEAF thus provides a practical, change-aware benchmark that reveals nuanced failure modes, guides model improvement, and highlights limitations tied to dataset diversity and automated manipulation pipelines.

Abstract

Vision language models (VLMs) perceive the world through a combination of a visual encoder and a large language model (LLM). The visual encoder, pre-trained on large-scale vision-text datasets, provides zero-shot generalization to visual data, and the LLM endows its high reasoning ability to VLMs. It leads VLMs to achieve high performance on wide benchmarks without fine-tuning, exhibiting zero or few-shot capability. However, recent studies show that VLMs are vulnerable to hallucination. This undesirable behavior degrades reliability and credibility, thereby making users unable to fully trust the output from VLMs. To enhance trustworthiness and better tackle the hallucination of VLMs, we curate a new evaluation dataset, called the BEfore-AFter hallucination dataset (BEAF), and introduce new metrics: True Understanding (TU), IGnorance (IG), StuBbornness (SB), and InDecision (ID). Unlike prior works that focus only on constructing questions and answers, the key idea of our benchmark is to manipulate visual scene information by image editing models and to design the metrics based on scene changes. This allows us to clearly assess whether VLMs correctly understand a given scene by observing the ability to perceive changes. We also visualize image-wise object relationship by virtue of our two-axis view: vision and text. Upon evaluating VLMs with our dataset, we observed that our metrics reveal different aspects of VLM hallucination that have not been reported before. Project page: \url{https://beafbench.github.io/}
Paper Structure (25 sections, 5 equations, 7 figures, 6 tables)

This paper contains 25 sections, 5 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Illustration of BEfore-AFter (BEAF) benchmark. We present a comparison between traditional evaluation benchmarks and our BEAF benchmark for assessing the hallucination behavior in VLMs. Traditional evaluation methods solely manipulate questions based on the existence of an object and measure accuracy or F1 score. In contrast, our BEAF benchmark not only constructs questions but also manipulates images and tracks changes in answers as the images undergo manipulation. The BEAF benchmark introduces novel metrics including True Understanding (TU), IGnorance (IG), StuBbornness (SB), and InDecision (ID), which consider the changes in answers.
  • Figure 2: Ratio of removed objects from image and object in question. For convenience, we report the ratio of upper-class categories instead of the object class itself. The total number of removed objects is the same as the number of manipulated images, and one object in question is the same as the number of image-question pairs.
  • Figure 3: Image manipulation pipeline. We illustrate the image manipulation pipeline comprised of three stages. In Stage 1, we automatically remove target objects sharing the same semantic class from the given images. Stage 2 is to filter the automatically manipulated results based on the predefined rules, such as mask errors, remained shadows, and low-quality outcomes. Undesirable manipulations are either corrected or discarded during this stage. Finally, in Stage 3, human annotators engage in human-guided manipulation of the filtered images to achieve high-quality results.
  • Figure 4: Visualization of image-wise object relationship. We visualize image-wise object relationship along the text and vision axes from Shikra. [Top] Original image samples. [Bottom] The object relation table given the manipulated images and questions. We color the correct answer blue and the wrong one red. The text axis stands for the target object queried in the question, and the vision axis for the removed object in the image. The "none" in the vision axis means the original image (not manipulated). Thereby, we can analyze influence between objects within a scene at once.
  • Figure 5: Object-wise error rate on the manipulated image-question pairs. We plot the error rate of each object to investigate which object is inaccurately inferred when it is removed. We exclude the objects manipulated less than 20 times. The solid line represents the average error rate of VLMs (7B), while the purple area indicates the 95% confidence interval.
  • ...and 2 more figures