Table of Contents
Fetching ...

Don't Fight Hallucinations, Use Them: Estimating Image Realism using NLI over Atomic Facts

Elisei Rykov, Kseniia Petrushina, Kseniia Titova, Alexander Panchenko, Vasily Konovalov

TL;DR

This work tackles estimating image realism by leveraging LVLM hallucinations as signals of common-sense violations. It introduces RealityCheck, a three-step pipeline that (1) generates atomic facts from images using an LVLM, (2) computes pairwise NLI scores between facts with cross-encoders, and (3) aggregates these scores into a single realism metric. The approach achieves state-of-the-art zero-shot performance on the WHOOPS! benchmark, outperforming open-source baselines and approaching fine-tuned systems, while demonstrating that hallucination signals can be effectively repurposed for visual realism assessment. This opens practical paths for open-source, zero-shot real-world image realism evaluation and provides insight into how inconsistencies among atomic facts relate to perceived nonsensical imagery.

Abstract

Quantifying the realism of images remains a challenging problem in the field of artificial intelligence. For example, an image of Albert Einstein holding a smartphone violates common-sense because modern smartphone were invented after Einstein's death. We introduce a novel method for assessing image realism using Large Vision-Language Models (LVLMs) and Natural Language Inference (NLI). Our approach is based on the premise that LVLMs may generate hallucinations when confronted with images that defy common sense. Using LVLM to extract atomic facts from these images, we obtain a mix of accurate facts and erroneous hallucinations. We proceed by calculating pairwise entailment scores among these facts, subsequently aggregating these values to yield a singular reality score. This process serves to identify contradictions between genuine facts and hallucinatory elements, signaling the presence of images that violate common sense. Our approach has achieved a new state-of-the-art performance in zero-shot mode on the WHOOPS! dataset.

Don't Fight Hallucinations, Use Them: Estimating Image Realism using NLI over Atomic Facts

TL;DR

This work tackles estimating image realism by leveraging LVLM hallucinations as signals of common-sense violations. It introduces RealityCheck, a three-step pipeline that (1) generates atomic facts from images using an LVLM, (2) computes pairwise NLI scores between facts with cross-encoders, and (3) aggregates these scores into a single realism metric. The approach achieves state-of-the-art zero-shot performance on the WHOOPS! benchmark, outperforming open-source baselines and approaching fine-tuned systems, while demonstrating that hallucination signals can be effectively repurposed for visual realism assessment. This opens practical paths for open-source, zero-shot real-world image realism evaluation and provides insight into how inconsistencies among atomic facts relate to perceived nonsensical imagery.

Abstract

Quantifying the realism of images remains a challenging problem in the field of artificial intelligence. For example, an image of Albert Einstein holding a smartphone violates common-sense because modern smartphone were invented after Einstein's death. We introduce a novel method for assessing image realism using Large Vision-Language Models (LVLMs) and Natural Language Inference (NLI). Our approach is based on the premise that LVLMs may generate hallucinations when confronted with images that defy common sense. Using LVLM to extract atomic facts from these images, we obtain a mix of accurate facts and erroneous hallucinations. We proceed by calculating pairwise entailment scores among these facts, subsequently aggregating these values to yield a singular reality score. This process serves to identify contradictions between genuine facts and hallucinatory elements, signaling the presence of images that violate common sense. Our approach has achieved a new state-of-the-art performance in zero-shot mode on the WHOOPS! dataset.

Paper Structure

This paper contains 18 sections, 6 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: A pair of images from the WHOOPS! dataset with corresponding generated atomic facts. The normal image is on the left, and the unusual image is on the right. All the facts associated with the normal image are consistent and accurately describe the image. However, in the case of the weird image, LVLM hallucinates and generates untruthful facts.
  • Figure 2: Weird images detection pipline. First, we generate five atomic facts that describe the images with LVLM (llava-v1.6-mistral-7b-hf). Then, we proceed with the matrix of pairwise NLI scores, where each NLI score is a weighted combination of entailment, neutral, and contradiction scores. The last step is aggregating NLI scores. Then, based on the aggregated score, we decide whether the image is strange or not.
  • Figure :
  • Figure :