AIGCs Confuse AI Too: Investigating and Explaining Synthetic Image-induced Hallucinations in Large Vision-Language Models
Yifei Gao, Jiaqi Wang, Zhiyu Lin, Jitao Sang
TL;DR
The study reveals that synthetic images can exacerbate hallucinations in Large Vision-Language Models (LVLMs) and introduces a synthetic-image hallucination bias characterized by higher hallucination counts and more uniform content placement. It proposes Semantics Translation to generate semantically faithful synthetic images and evaluates LVLMs on POPE and AMBER, uncovering bias across generative and discriminative tasks. A key finding is that the visual projection module, particularly Q-former versus linear projection, can amplify or mitigate this bias, suggesting concrete architectural levers for mitigation. The work highlights important natural-synthetic out-of-distribution considerations and outlines directions for reducing synthetic-data-induced risks in real-world deployments.
Abstract
The evolution of Artificial Intelligence Generated Contents (AIGCs) is advancing towards higher quality. The growing interactions with AIGCs present a new challenge to the data-driven AI community: While AI-generated contents have played a crucial role in a wide range of AI models, the potential hidden risks they introduce have not been thoroughly examined. Beyond human-oriented forgery detection, AI-generated content poses potential issues for AI models originally designed to process natural data. In this study, we underscore the exacerbated hallucination phenomena in Large Vision-Language Models (LVLMs) caused by AI-synthetic images. Remarkably, our findings shed light on a consistent AIGC \textbf{hallucination bias}: the object hallucinations induced by synthetic images are characterized by a greater quantity and a more uniform position distribution, even these synthetic images do not manifest unrealistic or additional relevant visual features compared to natural images. Moreover, our investigations on Q-former and Linear projector reveal that synthetic images may present token deviations after visual projection, thereby amplifying the hallucination bias.
