Table of Contents
Fetching ...

Evaluating Hallucination in Text-to-Image Diffusion Models with Scene-Graph based Question-Answering Agent

Ziyuan Qin, Dongjie Cheng, Haoyu Wang, Huahui Yi, Yuting Shao, Zhiyuan Fan, Kang Li, Qicheng Lao

TL;DR

The paper tackles the challenge of evaluating hallucination in text-to-image diffusion models by introducing a scene-graph based question-answering framework that leverages an image-derived knowledge graph and a large language model. It builds a 12,000-image dataset with human scores using three diffusion models and complex prompts to enable robust evaluation. The method detects content discrepancies at object, attribute, and relation levels, categorizes hallucinations, and provides a unified score via GraphQA and a rule-based module, showing stronger alignment with human judgments than existing metrics. This approach advances interpretable, automatic evaluation for T2I systems and offers practical insight into the sources of hallucination to guide model improvements.

Abstract

Contemporary Text-to-Image (T2I) models frequently depend on qualitative human evaluations to assess the consistency between synthesized images and the text prompts. There is a demand for quantitative and automatic evaluation tools, given that human evaluation lacks reproducibility. We believe that an effective T2I evaluation metric should accomplish the following: detect instances where the generated images do not align with the textual prompts, a discrepancy we define as the `hallucination problem' in T2I tasks; record the types and frequency of hallucination issues, aiding users in understanding the causes of errors; and provide a comprehensive and intuitive scoring that close to human standard. To achieve these objectives, we propose a method based on large language models (LLMs) for conducting question-answering with an extracted scene-graph and created a dataset with human-rated scores for generated images. From the methodology perspective, we combine knowledge-enhanced question-answering tasks with image evaluation tasks, making the evaluation metrics more controllable and easier to interpret. For the contribution on the dataset side, we generated 12,000 synthesized images based on 1,000 composited prompts using three advanced T2I models. Subsequently, we conduct human scoring on all synthesized images and prompt pairs to validate the accuracy and effectiveness of our method as an evaluation metric. All generated images and the human-labeled scores will be made publicly available in the future to facilitate ongoing research on this crucial issue. Extensive experiments show that our method aligns more closely with human scoring patterns than other evaluation metrics.

Evaluating Hallucination in Text-to-Image Diffusion Models with Scene-Graph based Question-Answering Agent

TL;DR

The paper tackles the challenge of evaluating hallucination in text-to-image diffusion models by introducing a scene-graph based question-answering framework that leverages an image-derived knowledge graph and a large language model. It builds a 12,000-image dataset with human scores using three diffusion models and complex prompts to enable robust evaluation. The method detects content discrepancies at object, attribute, and relation levels, categorizes hallucinations, and provides a unified score via GraphQA and a rule-based module, showing stronger alignment with human judgments than existing metrics. This approach advances interpretable, automatic evaluation for T2I systems and offers practical insight into the sources of hallucination to guide model improvements.

Abstract

Contemporary Text-to-Image (T2I) models frequently depend on qualitative human evaluations to assess the consistency between synthesized images and the text prompts. There is a demand for quantitative and automatic evaluation tools, given that human evaluation lacks reproducibility. We believe that an effective T2I evaluation metric should accomplish the following: detect instances where the generated images do not align with the textual prompts, a discrepancy we define as the `hallucination problem' in T2I tasks; record the types and frequency of hallucination issues, aiding users in understanding the causes of errors; and provide a comprehensive and intuitive scoring that close to human standard. To achieve these objectives, we propose a method based on large language models (LLMs) for conducting question-answering with an extracted scene-graph and created a dataset with human-rated scores for generated images. From the methodology perspective, we combine knowledge-enhanced question-answering tasks with image evaluation tasks, making the evaluation metrics more controllable and easier to interpret. For the contribution on the dataset side, we generated 12,000 synthesized images based on 1,000 composited prompts using three advanced T2I models. Subsequently, we conduct human scoring on all synthesized images and prompt pairs to validate the accuracy and effectiveness of our method as an evaluation metric. All generated images and the human-labeled scores will be made publicly available in the future to facilitate ongoing research on this crucial issue. Extensive experiments show that our method aligns more closely with human scoring patterns than other evaluation metrics.

Paper Structure

This paper contains 20 sections, 1 equation, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The complete framework for the text-to-image (T2I) generation to evaluation process is as follows: (a) First, we generate images from the textual prompts; (b) Then, we extract entities from these images and construct a knowledge graph; (c) Subsequently, we generate template questions based on the textual prompts and perform GraphQA using the constructed knowledge graph. Finally, we score the answers to obtain the final evaluation score.
  • Figure 2: Qualitative Evaluation: We select several hallucinated images not aligned with the prompts for each generative models. Each column indicate a specific type of hallucination issue. The last column are the images that totally off-topic to the prompts.