Table of Contents
Fetching ...

AIGCs Confuse AI Too: Investigating and Explaining Synthetic Image-induced Hallucinations in Large Vision-Language Models

Yifei Gao, Jiaqi Wang, Zhiyu Lin, Jitao Sang

TL;DR

The study reveals that synthetic images can exacerbate hallucinations in Large Vision-Language Models (LVLMs) and introduces a synthetic-image hallucination bias characterized by higher hallucination counts and more uniform content placement. It proposes Semantics Translation to generate semantically faithful synthetic images and evaluates LVLMs on POPE and AMBER, uncovering bias across generative and discriminative tasks. A key finding is that the visual projection module, particularly Q-former versus linear projection, can amplify or mitigate this bias, suggesting concrete architectural levers for mitigation. The work highlights important natural-synthetic out-of-distribution considerations and outlines directions for reducing synthetic-data-induced risks in real-world deployments.

Abstract

The evolution of Artificial Intelligence Generated Contents (AIGCs) is advancing towards higher quality. The growing interactions with AIGCs present a new challenge to the data-driven AI community: While AI-generated contents have played a crucial role in a wide range of AI models, the potential hidden risks they introduce have not been thoroughly examined. Beyond human-oriented forgery detection, AI-generated content poses potential issues for AI models originally designed to process natural data. In this study, we underscore the exacerbated hallucination phenomena in Large Vision-Language Models (LVLMs) caused by AI-synthetic images. Remarkably, our findings shed light on a consistent AIGC \textbf{hallucination bias}: the object hallucinations induced by synthetic images are characterized by a greater quantity and a more uniform position distribution, even these synthetic images do not manifest unrealistic or additional relevant visual features compared to natural images. Moreover, our investigations on Q-former and Linear projector reveal that synthetic images may present token deviations after visual projection, thereby amplifying the hallucination bias.

AIGCs Confuse AI Too: Investigating and Explaining Synthetic Image-induced Hallucinations in Large Vision-Language Models

TL;DR

The study reveals that synthetic images can exacerbate hallucinations in Large Vision-Language Models (LVLMs) and introduces a synthetic-image hallucination bias characterized by higher hallucination counts and more uniform content placement. It proposes Semantics Translation to generate semantically faithful synthetic images and evaluates LVLMs on POPE and AMBER, uncovering bias across generative and discriminative tasks. A key finding is that the visual projection module, particularly Q-former versus linear projection, can amplify or mitigate this bias, suggesting concrete architectural levers for mitigation. The work highlights important natural-synthetic out-of-distribution considerations and outlines directions for reducing synthetic-data-induced risks in real-world deployments.

Abstract

The evolution of Artificial Intelligence Generated Contents (AIGCs) is advancing towards higher quality. The growing interactions with AIGCs present a new challenge to the data-driven AI community: While AI-generated contents have played a crucial role in a wide range of AI models, the potential hidden risks they introduce have not been thoroughly examined. Beyond human-oriented forgery detection, AI-generated content poses potential issues for AI models originally designed to process natural data. In this study, we underscore the exacerbated hallucination phenomena in Large Vision-Language Models (LVLMs) caused by AI-synthetic images. Remarkably, our findings shed light on a consistent AIGC \textbf{hallucination bias}: the object hallucinations induced by synthetic images are characterized by a greater quantity and a more uniform position distribution, even these synthetic images do not manifest unrealistic or additional relevant visual features compared to natural images. Moreover, our investigations on Q-former and Linear projector reveal that synthetic images may present token deviations after visual projection, thereby amplifying the hallucination bias.
Paper Structure (13 sections, 9 figures, 3 tables)

This paper contains 13 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: A hallucination example on both synthetic (right) and natural images (left), where the highlighted fonts indicate the hallucinated content. Evaluation results across various vision-language tasks, such as semantic descriptions and factual judgments, consistently illustrate the existence of a synthetic image-induced hallucination bias.
  • Figure 2: The pipeline of semantics translation method. On the left side, we introduce caption generation and revision method to synthesize a correct description of the given natural image. Red represents the redundant or incorrect information within the initial caption. On the right side, we utilize image synthesis and filtering strategy to sample the final synthetic image, ensuring a strict correspondence to the revised caption and the input natural image. highlighted represents the redundant object in image synthesis process. The final synthetic image satisfies the criteria of authenticity and consistency.
  • Figure 3: The comparison of key object positions before and after the caption revision. Taking Stable Diffusion v1.5 as an example, where the accepted character limit is 77, the distribution of key objects in the revised caption generally satisfies the limits.
  • Figure 4: Hallucination statistics on different discriminative tasks reasoning within each pair of synthetic and natural image. Discriminative task consider reasoning on attribute, existence and relation semantics, separately. We highlight that the attribute semantic contains the action, number and state information of the annotated objects, separately.
  • Figure 5: The relative position distribution of hallucinated objects between synthetic and natural images.
  • ...and 4 more figures