Table of Contents
Fetching ...

A Comprehensive Survey of Hallucination in Large Language, Image, Video and Audio Foundation Models

Pranab Sahoo, Prabhash Meharia, Akash Ghosh, Sriparna Saha, Vinija Jain, Aman Chadha

TL;DR

This survey presents a holistic, multimodal framework for understanding hallucination in foundation models spanning language, vision, audio, and video. It synthesizes detection and mitigation techniques, introduces a taxonomy, and reviews modality-specific benchmarks and datasets. By covering LLMs, LVLMs, video, and audio models, the paper highlights cross-cutting challenges and practical evaluation gaps, proposing directions toward data quality, automated assessment, and grounded reasoning. The work aims to guide researchers and practitioners in building more reliable, trustworthy multimodal AI systems.

Abstract

The rapid advancement of foundation models (FMs) across language, image, audio, and video domains has shown remarkable capabilities in diverse tasks. However, the proliferation of FMs brings forth a critical challenge: the potential to generate hallucinated outputs, particularly in high-stakes applications. The tendency of foundation models to produce hallucinated content arguably represents the biggest hindrance to their widespread adoption in real-world scenarios, especially in domains where reliability and accuracy are paramount. This survey paper presents a comprehensive overview of recent developments that aim to identify and mitigate the problem of hallucination in FMs, spanning text, image, video, and audio modalities. By synthesizing recent advancements in detecting and mitigating hallucination across various modalities, the paper aims to provide valuable insights for researchers, developers, and practitioners. Essentially, it establishes a clear framework encompassing definition, taxonomy, and detection strategies for addressing hallucination in multimodal foundation models, laying the foundation for future research in this pivotal area.

A Comprehensive Survey of Hallucination in Large Language, Image, Video and Audio Foundation Models

TL;DR

This survey presents a holistic, multimodal framework for understanding hallucination in foundation models spanning language, vision, audio, and video. It synthesizes detection and mitigation techniques, introduces a taxonomy, and reviews modality-specific benchmarks and datasets. By covering LLMs, LVLMs, video, and audio models, the paper highlights cross-cutting challenges and practical evaluation gaps, proposing directions toward data quality, automated assessment, and grounded reasoning. The work aims to guide researchers and practitioners in building more reliable, trustworthy multimodal AI systems.

Abstract

The rapid advancement of foundation models (FMs) across language, image, audio, and video domains has shown remarkable capabilities in diverse tasks. However, the proliferation of FMs brings forth a critical challenge: the potential to generate hallucinated outputs, particularly in high-stakes applications. The tendency of foundation models to produce hallucinated content arguably represents the biggest hindrance to their widespread adoption in real-world scenarios, especially in domains where reliability and accuracy are paramount. This survey paper presents a comprehensive overview of recent developments that aim to identify and mitigate the problem of hallucination in FMs, spanning text, image, video, and audio modalities. By synthesizing recent advancements in detecting and mitigating hallucination across various modalities, the paper aims to provide valuable insights for researchers, developers, and practitioners. Essentially, it establishes a clear framework encompassing definition, taxonomy, and detection strategies for addressing hallucination in multimodal foundation models, laying the foundation for future research in this pivotal area.
Paper Structure (20 sections, 6 figures, 1 table)

This paper contains 20 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Illustration of Hallucination types. Proper explanations of hallucinations are indicated as hallucinated elements (HE) and are highlighted in bold red text.
  • Figure 2: Taxonomy of hallucination in large foundation models, organized around detection and mitigation techniques.
  • Figure 3: LLM responses showing the types of hallucinations, highlighted in red, green, and bluezhang2023siren.
  • Figure 4: Four IVL-Hallu examples in Prompted Hallucination Dataset(PhD) liu2024phd including visuals and the matching question-answer pairs and hallucination elements (HE). While words annotated in red do not exist or do not match within the image, words annotated in green have correspondences within the image. Question, Answer, and Statement are denoted by the letters Q, A, and S, respectively.
  • Figure 5: A video featuring descriptions generated by VLTinT model and ground truth (GT) with description errors highlighted in red italics. chuang2023clearvid.
  • ...and 1 more figures