Table of Contents
Fetching ...

HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning

Zhecan Wang, Garrett Bingham, Adams Yu, Quoc Le, Thang Luong, Golnaz Ghiasi

TL;DR

HaloQuest addresses the pervasive problem of vision-language model hallucination by introducing a novel VQA benchmark that uses both real and prompt-generated synthetic images to systematically elicit false premises, insufficient context, and visually challenging questions. The dataset employs a machine-human-in-the-loop pipeline and an open-ended Auto-Eval framework based on a Langfun schema to enable scalable, nuanced evaluation aligned with human judgments. Empirical results show current open-source VLMs struggle with HaloQuest in zero-shot settings, while fine-tuning on HaloQuest reduces hallucination without sacrificing standard reasoning tasks; synthetic images further enable scalable evaluation with results highly correlated to real-image performance ($r\approx$0.97). The work also demonstrates Auto-Eval’s high agreement with human raters (≈$95\%$ with advanced prompting) and provides evidence that HaloQuest data improves robustness to hallucination on related benchmarks like POPE. Overall, HaloQuest advances understanding, evaluation, and mitigation of multimodal hallucination, offering a scalable path toward more reliable multimodal AI systems.

Abstract

Hallucination has been a major problem for large language models and remains a critical challenge when it comes to multimodality in which vision-language models (VLMs) have to deal with not just textual but also visual inputs. Despite rapid progress in VLMs, resources for evaluating and addressing multimodal hallucination are limited and mostly focused on evaluation. This work introduces HaloQuest, a novel visual question answering dataset that captures various aspects of multimodal hallucination such as false premises, insufficient contexts, and visual challenges. A novel idea from HaloQuest is to leverage synthetic images, apart from real ones, to enable dataset creation at scale. With over 7.7K examples spanning across a wide variety of categories, HaloQuest was designed to be both a challenging benchmark for VLMs and a fine-tuning dataset for advancing multimodal reasoning. Our experiments reveal that current models struggle with HaloQuest, with all open-source VLMs achieving below 36% accuracy. On the other hand, fine-tuning on HaloQuest significantly reduces hallucination rates while preserving performance on standard reasoning tasks. Our results discover that benchmarking with generated images is highly correlated (r=0.97) with real images. Last but not least, we propose a novel Auto-Eval mechanism that is highly correlated with human raters (r=0.99) for evaluating VLMs. In sum, this work makes concrete strides towards understanding, evaluating, and mitigating hallucination in VLMs, serving as an important step towards more reliable multimodal AI systems in the future.

HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning

TL;DR

HaloQuest addresses the pervasive problem of vision-language model hallucination by introducing a novel VQA benchmark that uses both real and prompt-generated synthetic images to systematically elicit false premises, insufficient context, and visually challenging questions. The dataset employs a machine-human-in-the-loop pipeline and an open-ended Auto-Eval framework based on a Langfun schema to enable scalable, nuanced evaluation aligned with human judgments. Empirical results show current open-source VLMs struggle with HaloQuest in zero-shot settings, while fine-tuning on HaloQuest reduces hallucination without sacrificing standard reasoning tasks; synthetic images further enable scalable evaluation with results highly correlated to real-image performance (0.97). The work also demonstrates Auto-Eval’s high agreement with human raters (≈ with advanced prompting) and provides evidence that HaloQuest data improves robustness to hallucination on related benchmarks like POPE. Overall, HaloQuest advances understanding, evaluation, and mitigation of multimodal hallucination, offering a scalable path toward more reliable multimodal AI systems.

Abstract

Hallucination has been a major problem for large language models and remains a critical challenge when it comes to multimodality in which vision-language models (VLMs) have to deal with not just textual but also visual inputs. Despite rapid progress in VLMs, resources for evaluating and addressing multimodal hallucination are limited and mostly focused on evaluation. This work introduces HaloQuest, a novel visual question answering dataset that captures various aspects of multimodal hallucination such as false premises, insufficient contexts, and visual challenges. A novel idea from HaloQuest is to leverage synthetic images, apart from real ones, to enable dataset creation at scale. With over 7.7K examples spanning across a wide variety of categories, HaloQuest was designed to be both a challenging benchmark for VLMs and a fine-tuning dataset for advancing multimodal reasoning. Our experiments reveal that current models struggle with HaloQuest, with all open-source VLMs achieving below 36% accuracy. On the other hand, fine-tuning on HaloQuest significantly reduces hallucination rates while preserving performance on standard reasoning tasks. Our results discover that benchmarking with generated images is highly correlated (r=0.97) with real images. Last but not least, we propose a novel Auto-Eval mechanism that is highly correlated with human raters (r=0.99) for evaluating VLMs. In sum, this work makes concrete strides towards understanding, evaluating, and mitigating hallucination in VLMs, serving as an important step towards more reliable multimodal AI systems in the future.
Paper Structure (23 sections, 8 figures, 8 tables)

This paper contains 23 sections, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Example entries from HaloQuest (bottom) and other benchmarks (top). Current benchmarks often do not incorporate synthetic images, require one-word responses, are multiple choice, or simply ask for an image description. In contrast, HaloQuest contains challenging questions in three categories, uses both real and synthetic images, and makes use of Auto-Eval to allow for free-form answer evaluation.
  • Figure 2: HaloQuest data collection pipeline. First, both real and synthetic images are collected from various sources. Next, humans and LLMs create question-answer pairs designed to elicit hallucination. Finally, a filtering mechanism removes the entires that are overly simple or ambiguous. The result is a challenging dataset that effectively exposes model hallucination tendencies.
  • Figure 3: Human evaluation vs. different evaluation metrics. Metrics are based on zero-shot evaluation (Table \ref{['tab:zero_shot']}). Standard metrics like BLEU, CIDER, ROUGE, and METEOR do not correlate well with human evaluation, demonstrating that they are insufficient for characterizing VLM hallucinationbleuciderrougemeteor. In contrast, Auto-Eval correlates strongly with human evaluation (Pearson's r), thus facilitating hallucination evaluation at scale pearson.
  • Figure 4: Low-dimensional representation of images. Each point represents one image. CLIP embeddings were extracted for all images and then projected to a 2D space using the UMAP algorithm. HaloQuest real images occupy a similar semantic distribution to VQA v2 images, while the synthetic images are entirely novel.
  • Figure 5: Text-only prompting
  • ...and 3 more figures