Align Beyond Prompts: Evaluating World Knowledge Alignment in Text-to-Image Generation

Wenchao Zhang; Jiahe Tian; Runze He; Jizhong Han; Jiao Dai; Miaomiao Feng; Wei Mi; Xiaodan Zhang

Align Beyond Prompts: Evaluating World Knowledge Alignment in Text-to-Image Generation

Wenchao Zhang, Jiahe Tian, Runze He, Jizhong Han, Jiao Dai, Miaomiao Feng, Wei Mi, Xiaodan Zhang

TL;DR

ABP tackles the gap of evaluating text-to-image generation against real-world knowledge beyond user prompts by introducing a 2,060-prompt benchmark spanning six knowledge domains and the ABPScore metric that leverages Multimodal Large Language Models to verify implied knowledge in generated images. Through evaluations of eight state-of-the-art T2I models, ABP reveals that even top systems like GPT-4o struggle to consistently embed simple world knowledge, especially in chemical domains. The paper also presents Inference-Time Knowledge Injection (ITKI), a training-free strategy that augments prompts to improve ABPScore by about 43% on 200 challenging samples, demonstrating the potential of prompt-level reasoning enhancements. Overall, ABP and ABPScore offer a rigorous framework for measuring and improving world-knowledge alignment in T2I, with significant implications for reliability and safe deployment of image generation systems.

Abstract

Recent text-to-image (T2I) generation models have advanced significantly, enabling the creation of high-fidelity images from textual prompts. However, existing evaluation benchmarks primarily focus on the explicit alignment between generated images and prompts, neglecting the alignment with real-world knowledge beyond prompts. To address this gap, we introduce Align Beyond Prompts (ABP), a comprehensive benchmark designed to measure the alignment of generated images with real-world knowledge that extends beyond the explicit user prompts. ABP comprises over 2,000 meticulously crafted prompts, covering real-world knowledge across six distinct scenarios. We further introduce ABPScore, a metric that utilizes existing Multimodal Large Language Models (MLLMs) to assess the alignment between generated images and world knowledge beyond prompts, which demonstrates strong correlations with human judgments. Through a comprehensive evaluation of 8 popular T2I models using ABP, we find that even state-of-the-art models, such as GPT-4o, face limitations in integrating simple real-world knowledge into generated images. To mitigate this issue, we introduce a training-free strategy within ABP, named Inference-Time Knowledge Injection (ITKI). By applying this strategy to optimize 200 challenging samples, we achieved an improvement of approximately 43% in ABPScore. The dataset and code are available in https://github.com/smile365317/ABP.

Align Beyond Prompts: Evaluating World Knowledge Alignment in Text-to-Image Generation

TL;DR

Abstract

Align Beyond Prompts: Evaluating World Knowledge Alignment in Text-to-Image Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)