Table of Contents
Fetching ...

A Survey of Multimodal Hallucination Evaluation and Detection

Zhiyuan Chen, Yuecong Min, Jie Zhang, Bei Yan, Jiahao Wang, Xiaozhen Wang, Shiguang Shan

TL;DR

This survey addresses the hallucination problem in multimodal large language models by introducing a dual taxonomy of faithfulness and factuality, applicable to both I2T and T2I tasks. It compiles and categorizes existing benchmarks (discriminative, generative, and comprehensive) and outlines metrics, data sources, and construction methods, while highlighting trends toward automated generation and fine-grained evaluation. The paper also surveys a wide range of detection methods (black-box, white-box, and unified) and discusses how I2T and T2I detection strategies can inform each other, noting current gaps such as limited domain-specific factuality evaluation and the need for unified, scalable frameworks. Finally, it identifies key challenges and outlines future directions, including explainable evaluation, domain-aware benchmarks, and real-world scenario testing to improve reliability and safety of multimodal systems.

Abstract

Multi-modal Large Language Models (MLLMs) have emerged as a powerful paradigm for integrating visual and textual information, supporting a wide range of multi-modal tasks. However, these models often suffer from hallucination, producing content that appears plausible but contradicts the input content or established world knowledge. This survey offers an in-depth review of hallucination evaluation benchmarks and detection methods across Image-to-Text (I2T) and Text-to-image (T2I) generation tasks. Specifically, we first propose a taxonomy of hallucination based on faithfulness and factuality, incorporating the common types of hallucinations observed in practice. Then we provide an overview of existing hallucination evaluation benchmarks for both T2I and I2T tasks, highlighting their construction process, evaluation objectives, and employed metrics. Furthermore, we summarize recent advances in hallucination detection methods, which aims to identify hallucinated content at the instance level and serve as a practical complement of benchmark-based evaluation. Finally, we highlight key limitations in current benchmarks and detection methods, and outline potential directions for future research.

A Survey of Multimodal Hallucination Evaluation and Detection

TL;DR

This survey addresses the hallucination problem in multimodal large language models by introducing a dual taxonomy of faithfulness and factuality, applicable to both I2T and T2I tasks. It compiles and categorizes existing benchmarks (discriminative, generative, and comprehensive) and outlines metrics, data sources, and construction methods, while highlighting trends toward automated generation and fine-grained evaluation. The paper also surveys a wide range of detection methods (black-box, white-box, and unified) and discusses how I2T and T2I detection strategies can inform each other, noting current gaps such as limited domain-specific factuality evaluation and the need for unified, scalable frameworks. Finally, it identifies key challenges and outlines future directions, including explainable evaluation, domain-aware benchmarks, and real-world scenario testing to improve reliability and safety of multimodal systems.

Abstract

Multi-modal Large Language Models (MLLMs) have emerged as a powerful paradigm for integrating visual and textual information, supporting a wide range of multi-modal tasks. However, these models often suffer from hallucination, producing content that appears plausible but contradicts the input content or established world knowledge. This survey offers an in-depth review of hallucination evaluation benchmarks and detection methods across Image-to-Text (I2T) and Text-to-image (T2I) generation tasks. Specifically, we first propose a taxonomy of hallucination based on faithfulness and factuality, incorporating the common types of hallucinations observed in practice. Then we provide an overview of existing hallucination evaluation benchmarks for both T2I and I2T tasks, highlighting their construction process, evaluation objectives, and employed metrics. Furthermore, we summarize recent advances in hallucination detection methods, which aims to identify hallucinated content at the instance level and serve as a practical complement of benchmark-based evaluation. Finally, we highlight key limitations in current benchmarks and detection methods, and outline potential directions for future research.

Paper Structure

This paper contains 32 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Trends in the number of relevant papers on I2T and T2I hallucination evaluation from Google Scholar, highlighting the rapid growth in recent years. Dashed lines indicate approximate predictions.
  • Figure 2: Overview of the main structure and taxonomy presented in this survey.
  • Figure 3: Examples of object-level (a), attribute-level (b), and scene-level (c) hallucinations (from left to right) in image-to-text (top) and text-to-image (bottom) tasks. Hallucinated responses are highlighted in red. All the images are generated by Stable Diffusion 1.5 rombach2022high.
  • Figure 4: Examples of Commonsense-based (a), Physical-specific (b), and Medical-specific (c) hallucinations (from left to right) in image-to-text (top) and text-to-image (bottom) tasks. Hallucinated responses are highlighted in red. All the images are generated by Stable Diffusion 1.5 rombach2022high.
  • Figure 5: Examples of detection methods. At the top is a mismatched image-text pair, and hallucinated text contents are highlighted in red. At the bottom are representative (a) Detector-based, (b) Caption-based, and (c) VQA-based hallucination detection methods arranged from left to right. The reference labels are highlighted in blue.