Table of Contents
Fetching ...

DASH: Detection and Assessment of Systematic Hallucinations of VLMs

Maximilian Augustin, Yannic Neuhaus, Matthias Hein

TL;DR

DASH presents a scalable, automatic pipeline to detect and assess systematic false-positive hallucinations in vision-language models by combining text-based (DASH-LLM) and image-based (DASH-OPT) retrieval over open-world data. Through exploration, exploitation, and clustering on the ReLaION-5B corpus, it uncovers thousands of hallucination clusters across hundreds of object categories and demonstrates that these failure modes transfer to unseen models. DASH-B provides a harder benchmark to evaluate FP-hallucinations beyond saturated benchmarks like POPE, while fine-tuning with DASH data shows notable mitigation benefits. The work highlights the importance of open-world evaluation and data-driven mitigation for robust multimodal understanding in real-world applications.

Abstract

Vision-language models (VLMs) are prone to object hallucinations, where they erroneously indicate the presenceof certain objects in an image. Existing benchmarks quantify hallucinations using relatively small, labeled datasets. However, this approach is i) insufficient to assess hallucinations that arise in open-world settings, where VLMs are widely used, and ii) inadequate for detecting systematic errors in VLMs. We propose DASH (Detection and Assessment of Systematic Hallucinations), an automatic, large-scale pipeline designed to identify systematic hallucinations of VLMs on real-world images in an open-world setting. A key component is DASH-OPT for image-based retrieval, where we optimize over the ''natural image manifold'' to generate images that mislead the VLM. The output of DASH consists of clusters of real and semantically similar images for which the VLM hallucinates an object. We apply DASH to PaliGemma and two LLaVA-NeXT models across 380 object classes and, in total, find more than 19k clusters with 950k images. We study the transfer of the identified systematic hallucinations to other VLMs and show that fine-tuning PaliGemma with the model-specific images obtained with DASH mitigates object hallucinations. Code and data are available at https://YanNeu.github.io/DASH.

DASH: Detection and Assessment of Systematic Hallucinations of VLMs

TL;DR

DASH presents a scalable, automatic pipeline to detect and assess systematic false-positive hallucinations in vision-language models by combining text-based (DASH-LLM) and image-based (DASH-OPT) retrieval over open-world data. Through exploration, exploitation, and clustering on the ReLaION-5B corpus, it uncovers thousands of hallucination clusters across hundreds of object categories and demonstrates that these failure modes transfer to unseen models. DASH-B provides a harder benchmark to evaluate FP-hallucinations beyond saturated benchmarks like POPE, while fine-tuning with DASH data shows notable mitigation benefits. The work highlights the importance of open-world evaluation and data-driven mitigation for robust multimodal understanding in real-world applications.

Abstract

Vision-language models (VLMs) are prone to object hallucinations, where they erroneously indicate the presenceof certain objects in an image. Existing benchmarks quantify hallucinations using relatively small, labeled datasets. However, this approach is i) insufficient to assess hallucinations that arise in open-world settings, where VLMs are widely used, and ii) inadequate for detecting systematic errors in VLMs. We propose DASH (Detection and Assessment of Systematic Hallucinations), an automatic, large-scale pipeline designed to identify systematic hallucinations of VLMs on real-world images in an open-world setting. A key component is DASH-OPT for image-based retrieval, where we optimize over the ''natural image manifold'' to generate images that mislead the VLM. The output of DASH consists of clusters of real and semantically similar images for which the VLM hallucinates an object. We apply DASH to PaliGemma and two LLaVA-NeXT models across 380 object classes and, in total, find more than 19k clusters with 950k images. We study the transfer of the identified systematic hallucinations to other VLMs and show that fine-tuning PaliGemma with the model-specific images obtained with DASH mitigates object hallucinations. Code and data are available at https://YanNeu.github.io/DASH.

Paper Structure

This paper contains 36 sections, 3 equations, 31 figures, 9 tables.

Figures (31)

  • Figure 1: DASH: Systematic Hallucinations of PaliGemma-3B.
  • Figure 2: DASH: Given an object class, e.g. dining table, we generate text-based queries with DASH-LLM or image-based queries with DASH-OPT. Optimization: we optimize the latent variables of a diffusion process to generate an image which yields "yes" for the VLM ("Can you see a dining table in this image?") and at the same time the object detector states that no "dining table" is present in the image. Exploration: the text and image queries are used for kNN-retrieval using CLIP similarity on ReLaion-5B. Exploitation: for successful images (VLM "yes", object detector "no") of the exploration phase we retrieve novel images via kNN-retrieval to check if the hallucination transfers to semantically similar images. Clustering: Finally, we cluster successful images of the exploitation step into semantically similar clusters of hallucinations of the VLM.
  • Figure 3: Examples of systematic FP-hallucination clusters found by DASH for PaliGemma: We present six hallucination clusters, each for a different object—three identified by DASH-LLM and three by DASH-OPT. For each cluster, we show a sample of images and the total number of images. For each of these images, PaliGemma answers “yes” to “Can you see a OBJ in this image?” while the object detector reports a confidence below 0.1. None of the images actually contain the object. We also provide the text (DASH-LLM) and image queries (DASH-OPT) used for retrieval during exploration for the majority of the cluster.
  • Figure 4: Histogram illustrating the minimum embedding distance from success images to the nearest LLM prompt for DASH-LLM and DASH-OPT. While both methods use these LLM prompts in their exploration stage, the image-based method is able to find unexpected hallucinations far away from the initial LLM prompts.
  • Figure 5: All clusters for DASH-LLM and DASH-OPT for the object 'Dam' using LLaVA-NeXT Vicuna. DASH-OPT identifies a larger total number of clusters and images, capturing a broader diversity of visuals. This demonstrates that DASH-OPT can uncover unexpected systematic hallucination patterns, such as cartoon frogs and dinosaurs, orange leaves, bare feet, or a park bench, whereas DASH-LLM tends to highlight failure modes more directly linked to the object, such as water associated with 'Dam'.
  • ...and 26 more figures