Table of Contents
Fetching ...

Addressing Image Hallucination in Text-to-Image Generation through Factual Image Retrieval

Youngsun Lim, Hyunjung Shim

TL;DR

The paper tackles image hallucination in text-to-image diffusion by grounding outputs in externally retrieved factual images. It introduces a tuning-free, retrieval-augmented workflow that uses either InstructPix2Pix or IP-Adapter, guided by LLM-derived instructions or prompts, and supports interactive image selection to reflect user intent. The approach addresses three hallucination types—factual inconsistency, outdated knowledge, and factual fabrication—via two editing pipelines and demonstrates qualitative improvements without retraining. This method offers a practical pathway to produce factually accurate T2I images, with potential impact on education, journalism, and other domains requiring reliable visual grounding.

Abstract

Text-to-image generation has shown remarkable progress with the emergence of diffusion models. However, these models often generate factually inconsistent images, failing to accurately reflect the factual information and common sense conveyed by the input text prompts. We refer to this issue as Image hallucination. Drawing from studies on hallucinations in language models, we classify this problem into three types and propose a methodology that uses factual images retrieved from external sources to generate realistic images. Depending on the nature of the hallucination, we employ off-the-shelf image editing tools, either InstructPix2Pix or IP-Adapter, to leverage factual information from the retrieved image. This approach enables the generation of images that accurately reflect the facts and common sense.

Addressing Image Hallucination in Text-to-Image Generation through Factual Image Retrieval

TL;DR

The paper tackles image hallucination in text-to-image diffusion by grounding outputs in externally retrieved factual images. It introduces a tuning-free, retrieval-augmented workflow that uses either InstructPix2Pix or IP-Adapter, guided by LLM-derived instructions or prompts, and supports interactive image selection to reflect user intent. The approach addresses three hallucination types—factual inconsistency, outdated knowledge, and factual fabrication—via two editing pipelines and demonstrates qualitative improvements without retraining. This method offers a practical pathway to produce factually accurate T2I images, with potential impact on education, journalism, and other domains requiring reliable visual grounding.

Abstract

Text-to-image generation has shown remarkable progress with the emergence of diffusion models. However, these models often generate factually inconsistent images, failing to accurately reflect the factual information and common sense conveyed by the input text prompts. We refer to this issue as Image hallucination. Drawing from studies on hallucinations in language models, we classify this problem into three types and propose a methodology that uses factual images retrieved from external sources to generate realistic images. Depending on the nature of the hallucination, we employ off-the-shelf image editing tools, either InstructPix2Pix or IP-Adapter, to leverage factual information from the retrieved image. This approach enables the generation of images that accurately reflect the facts and common sense.
Paper Structure (9 sections, 5 figures)

This paper contains 9 sections, 5 figures.

Figures (5)

  • Figure 1: Examples of image hallucination and the facts that should have been reflected.
  • Figure 2: The overall pipeline indicates two different strategies for preventing image hallucination based on the target of the hallucination.
  • Figure 3: Examples showing image hallucination due to factual inconsistency caused by co-occurrence bias, and images resolved by applying our methodology. Instruction is created and utilized using the input prompt and retrieved factual image.
  • Figure 4: Examples showing outdated knowledge hallucination caused by failure to reflect time-shift information and images resolved by applying our methodology. A factual prompt is generated and utilized using the input prompt and the retrieved factual image.
  • Figure 5: Examples showing image hallucination due to factual fabrication, and images resolved by applying our methodology. We applied the same methodology as shown in Figure 3.