Table of Contents
Fetching ...

Align Beyond Prompts: Evaluating World Knowledge Alignment in Text-to-Image Generation

Wenchao Zhang, Jiahe Tian, Runze He, Jizhong Han, Jiao Dai, Miaomiao Feng, Wei Mi, Xiaodan Zhang

TL;DR

ABP tackles the gap of evaluating text-to-image generation against real-world knowledge beyond user prompts by introducing a 2,060-prompt benchmark spanning six knowledge domains and the ABPScore metric that leverages Multimodal Large Language Models to verify implied knowledge in generated images. Through evaluations of eight state-of-the-art T2I models, ABP reveals that even top systems like GPT-4o struggle to consistently embed simple world knowledge, especially in chemical domains. The paper also presents Inference-Time Knowledge Injection (ITKI), a training-free strategy that augments prompts to improve ABPScore by about 43% on 200 challenging samples, demonstrating the potential of prompt-level reasoning enhancements. Overall, ABP and ABPScore offer a rigorous framework for measuring and improving world-knowledge alignment in T2I, with significant implications for reliability and safe deployment of image generation systems.

Abstract

Recent text-to-image (T2I) generation models have advanced significantly, enabling the creation of high-fidelity images from textual prompts. However, existing evaluation benchmarks primarily focus on the explicit alignment between generated images and prompts, neglecting the alignment with real-world knowledge beyond prompts. To address this gap, we introduce Align Beyond Prompts (ABP), a comprehensive benchmark designed to measure the alignment of generated images with real-world knowledge that extends beyond the explicit user prompts. ABP comprises over 2,000 meticulously crafted prompts, covering real-world knowledge across six distinct scenarios. We further introduce ABPScore, a metric that utilizes existing Multimodal Large Language Models (MLLMs) to assess the alignment between generated images and world knowledge beyond prompts, which demonstrates strong correlations with human judgments. Through a comprehensive evaluation of 8 popular T2I models using ABP, we find that even state-of-the-art models, such as GPT-4o, face limitations in integrating simple real-world knowledge into generated images. To mitigate this issue, we introduce a training-free strategy within ABP, named Inference-Time Knowledge Injection (ITKI). By applying this strategy to optimize 200 challenging samples, we achieved an improvement of approximately 43% in ABPScore. The dataset and code are available in https://github.com/smile365317/ABP.

Align Beyond Prompts: Evaluating World Knowledge Alignment in Text-to-Image Generation

TL;DR

ABP tackles the gap of evaluating text-to-image generation against real-world knowledge beyond user prompts by introducing a 2,060-prompt benchmark spanning six knowledge domains and the ABPScore metric that leverages Multimodal Large Language Models to verify implied knowledge in generated images. Through evaluations of eight state-of-the-art T2I models, ABP reveals that even top systems like GPT-4o struggle to consistently embed simple world knowledge, especially in chemical domains. The paper also presents Inference-Time Knowledge Injection (ITKI), a training-free strategy that augments prompts to improve ABPScore by about 43% on 200 challenging samples, demonstrating the potential of prompt-level reasoning enhancements. Overall, ABP and ABPScore offer a rigorous framework for measuring and improving world-knowledge alignment in T2I, with significant implications for reliability and safe deployment of image generation systems.

Abstract

Recent text-to-image (T2I) generation models have advanced significantly, enabling the creation of high-fidelity images from textual prompts. However, existing evaluation benchmarks primarily focus on the explicit alignment between generated images and prompts, neglecting the alignment with real-world knowledge beyond prompts. To address this gap, we introduce Align Beyond Prompts (ABP), a comprehensive benchmark designed to measure the alignment of generated images with real-world knowledge that extends beyond the explicit user prompts. ABP comprises over 2,000 meticulously crafted prompts, covering real-world knowledge across six distinct scenarios. We further introduce ABPScore, a metric that utilizes existing Multimodal Large Language Models (MLLMs) to assess the alignment between generated images and world knowledge beyond prompts, which demonstrates strong correlations with human judgments. Through a comprehensive evaluation of 8 popular T2I models using ABP, we find that even state-of-the-art models, such as GPT-4o, face limitations in integrating simple real-world knowledge into generated images. To mitigate this issue, we introduce a training-free strategy within ABP, named Inference-Time Knowledge Injection (ITKI). By applying this strategy to optimize 200 challenging samples, we achieved an improvement of approximately 43% in ABPScore. The dataset and code are available in https://github.com/smile365317/ABP.

Paper Structure

This paper contains 13 sections, 1 equation, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Examples of ABP, example presents images before (fail to align with real-world knowledge) on the left, and after (correction of alignment with world knowledge) optimization using ITKI on the right. Each image includes the ABPScore result, with correctly generated images marked with a check mark "✓", while incorrect images are marked with a cross "✗".
  • Figure 2: The construction of ABP. (Upper) Collection of Knowledge Anchors. We manually collect and filter knowledge anchors from various online repositories, including Wikipedia and ConceptNet, across six different scenes; (Middle) Constructing Prompts. We use GPT-4o to generate prompts from the collected knowledge anchors, which are then filtered and optimized to align with the criteria of Reasonability, Implicature, and Visualizability. Subsequently, images are generated using eight state-of-the-art T2I models; (Bottom) Quality Evaluation. We use ABPScore to extract world knowledge beyond the prompts and associated objects, and validate the alignment between the extracted knowledge and the generated images.
  • Figure 3: Statistics for the ABP dataset. The inner ring illustrates the six world knowledge domains covered by ABP: physical scenes, chemical scenes, animal scenes, plant scenes, human scenes, and factual scenes. As individual prompts may span multiple knowledge domains, the total number of prompts across all domains exceeds 2,060. The outer ring illustrates the five most frequent specific knowledge categories within each domain.
  • Figure 4: Human judgments. We show the average human judgments for eight T2I models, with the first four being open-source (SDXL, SD3-M, SD3.5-L, CogView4) and the remaining four being closed-source (Midjourney V6, Gemini 2.0, DALL-E 3, GPT-4o). Our analysis reveals two key insights: (1) all models demonstrate strong performance in factual scenes, but their performance is significantly weaker in chemical scenes, (2) open-source models still lag behind closed-source models.
  • Figure 5: Performance Comparison Before and After ITKI, each T2I model shows significant improvement.
  • ...and 1 more figures