Table of Contents
Fetching ...

Hallucination as an Upper Bound: A New Perspective on Text-to-Image Evaluation

Seyed Amir Kasaei, Mohammad Hossein Rohban

TL;DR

The paper addresses the lack of a formal framing for hallucination in text-to-image generation by defining hallucination as bias-driven deviations beyond the prompt and proposing a three-category taxonomy: object, attribute, and relation. It provides formal definitions using $P$, $O$, $O'$, $a'$, and $r$, including the upper-bound condition $O' ∩ O = ∅$ for object hallucination. The main contributions are a principled taxonomy, an evaluation perspective that yields an upper bound, and a call for benchmarks to expose hidden priors in T2I models. This framework enhances understanding of controllability, neutrality, and trust in real-world T2I deployment by highlighting biases that extend beyond user input.

Abstract

In language and vision-language models, hallucination is broadly understood as content generated from a model's prior knowledge or biases rather than from the given input. While this phenomenon has been studied in those domains, it has not been clearly framed for text-to-image (T2I) generative models. Existing evaluations mainly focus on alignment, checking whether prompt-specified elements appear, but overlook what the model generates beyond the prompt. We argue for defining hallucination in T2I as bias-driven deviations and propose a taxonomy with three categories: attribute, relation, and object hallucinations. This framing introduces an upper bound for evaluation and surfaces hidden biases, providing a foundation for richer assessment of T2I models.

Hallucination as an Upper Bound: A New Perspective on Text-to-Image Evaluation

TL;DR

The paper addresses the lack of a formal framing for hallucination in text-to-image generation by defining hallucination as bias-driven deviations beyond the prompt and proposing a three-category taxonomy: object, attribute, and relation. It provides formal definitions using , , , , and , including the upper-bound condition for object hallucination. The main contributions are a principled taxonomy, an evaluation perspective that yields an upper bound, and a call for benchmarks to expose hidden priors in T2I models. This framework enhances understanding of controllability, neutrality, and trust in real-world T2I deployment by highlighting biases that extend beyond user input.

Abstract

In language and vision-language models, hallucination is broadly understood as content generated from a model's prior knowledge or biases rather than from the given input. While this phenomenon has been studied in those domains, it has not been clearly framed for text-to-image (T2I) generative models. Existing evaluations mainly focus on alignment, checking whether prompt-specified elements appear, but overlook what the model generates beyond the prompt. We argue for defining hallucination in T2I as bias-driven deviations and propose a taxonomy with three categories: attribute, relation, and object hallucinations. This framing introduces an upper bound for evaluation and surfaces hidden biases, providing a foundation for richer assessment of T2I models.

Paper Structure

This paper contains 6 sections.