Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering

Youngsun Lim; Hojun Choi; Hyunjung Shim

Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering

Youngsun Lim, Hojun Choi, Hyunjung Shim

TL;DR

This work tackles image hallucination in text-to-image generation by introducing I-HallA, an automated evaluation metric that uses visual question answering to assess factual content in generated images. It also presents I-HallA v1.0, a 1.2K-image benchmark built from textbook content, employing a three-stage pipeline that leverages GPT-4o for reasoning and QA generation, with human validation throughout. Experiments across five state-of-the-art TTI models show persistent hallucination, while I-HallA scores strongly correlate with human judgments (e.g., Spearman ρ = 0.95), validating the metric's reliability. The framework emphasizes external knowledge and visual semantics beyond prompts and aims to guide the development of factually accurate TTI systems, with resources available on the project page.

Abstract

Despite the impressive success of text-to-image (TTI) generation models, existing studies overlook the issue of whether these models accurately convey factual information. In this paper, we focus on the problem of image hallucination, where images created by generation models fail to faithfully depict factual content. To address this, we introduce I-HallA (Image Hallucination evaluation with Question Answering), a novel automated evaluation metric that measures the factuality of generated images through visual question answering (VQA). We also introduce I-HallA v1.0, a curated benchmark dataset for this purpose. As part of this process, we develop a pipeline that generates high-quality question-answer pairs using multiple GPT-4 Omni-based agents, with human judgments to ensure accuracy. Our evaluation protocols measure image hallucination by testing if images from existing TTI models can correctly respond to these questions. The I-HallA v1.0 dataset comprises 1.2K diverse image-text pairs across nine categories with 1,000 rigorously curated questions covering various compositional challenges. We evaluate five TTI models using I-HallA and reveal that these state-of-the-art models often fail to accurately convey factual information. Moreover, we validate the reliability of our metric by demonstrating a strong Spearman correlation ($ρ$=0.95) with human judgments. We believe our benchmark dataset and metric can serve as a foundation for developing factually accurate TTI generation models. Additional resources can be found on our project page: https://sgt-lim.github.io/I-HallA/.

Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering

TL;DR

Abstract

=0.95) with human judgments. We believe our benchmark dataset and metric can serve as a foundation for developing factually accurate TTI generation models. Additional resources can be found on our project page: https://sgt-lim.github.io/I-HallA/.

Paper Structure (31 sections, 1 equation, 12 figures, 6 tables)

This paper contains 31 sections, 1 equation, 12 figures, 6 tables.

Introduction
Related Works
Hallucination in Language Generation
Common Sense Reasoning in VLMs
Evaluating Text-to-Image Generation with Question Answering
Methodology
Image Hallucination
I-HallA v1.0: Benchmark for Evaluating Image Hallucination
Collecting Our Dataset
Enhancing Our Dataset
I-HallA: An Evaluation Metric Using Question-Answering
Experiments
Benchmark Analysis
Statistics and diversity
GPT-4o's ability on image hallucination
...and 16 more sections

Figures (12)

Figure 1: Examples of the image hallucination and how I-HallA operates to evaluate it, along with a comparison to the existing metric, TIFA. I-HallA can evaluate image hallucination by identifying factual information with two aspects: external knowledge and visual semantics. In contrast, TIFA hardly evaluates image hallucination as it relies solely on text prompts. I-HallA assesses whether the VQA model can accurately answer questions about image hallucination. We use DallE-3 for the hallucinated images in this figure.
Figure 2: Overall pipeline of how I-HallA v1.0 is used for evaluating image hallucination: (a) Collect datasets containing prompts, factual images, and hallucinated images based on textbooks. (b) Enhance the collected dataset by leveraging the vast pre-trained knowledge and visual understanding capability of GPT-4o, adding reasoning about image hallucination to the datasets. (c) Input the prompt and reasoning into a language model to generate QA sets for evaluation. (d) Input the 5 QA sets per image and the target image into a vision-language model, and calculate the I-HallA score based on the number of correct answers. We employ GPT-4o for both the VLM and LLM.
Figure 3: Overview of I-HallA v1.0: The upper section presents the prompt, domain, category, reasoning, and I-HallA results for five QA sets. The lower section compares a factual image with hallucinated outputs from five TTI models, indicating difficulty levels and I-HallA scores. I-HallA scores shown in the bottom-right box of each image remain unchanged across the three trials.
Figure 4: I-HallA scores from five different TTI models across different categories and compositions. The factual information of images generated by TTI models using the prompts from I-HallA v1.0 is evaluated using the I-HallA metric. The I-HallA scores in this figure represent the average scores of each TTI model, calculated across different categories and compositions.
Figure 5: Plot of I-HallA scores from GPT-4o and human evaluations. Blue circles indicate average scores from 53 participants per question, with score distributions depicted via violin plots. Stars emphasize GPT-4o's results, which closely align with human judgments, demonstrating a strong correlation between the model and human evaluators.
...and 7 more figures

Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering

TL;DR

Abstract

Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering

Authors

TL;DR

Abstract

Table of Contents

Figures (12)