Bringing Textual Prompt to AI-Generated Image Quality Assessment
Bowen Qu, Haohui Li, Wei Gao
TL;DR
This work tackles AGI quality assessment by explicitly modeling the image and its textual prompt, addressing a core gap in unimodal IQA approaches. It introduces IP-IQA, a CLIP-based dual-stream framework augmented with Image2Prompt incremental pretraining, an image-prompt fusion module, and a specialized [QA] token to capture image-text alignment and perceptual quality. Empirical results on AGIQA-1k and AGIQA-3k demonstrate state-of-the-art performance and highlight the additive value of prompt integration and cross-modal fusion. The approach advances practical AGIQA by reflecting the multimodal nature of AI-generated content, with potential impact on evaluation pipelines and prompt-aware generation quality control.
Abstract
AI-Generated Images (AGIs) have inherent multimodal nature. Unlike traditional image quality assessment (IQA) on natural scenarios, AGIs quality assessment (AGIQA) takes the correspondence of image and its textual prompt into consideration. This is coupled in the ground truth score, which confuses the unimodal IQA methods. To solve this problem, we introduce IP-IQA (AGIs Quality Assessment via Image and Prompt), a multimodal framework for AGIQA via corresponding image and prompt incorporation. Specifically, we propose a novel incremental pretraining task named Image2Prompt for better understanding of AGIs and their corresponding textual prompts. An effective and efficient image-prompt fusion module, along with a novel special [QA] token, are also applied. Both are plug-and-play and beneficial for the cooperation of image and its corresponding prompt. Experiments demonstrate that our IP-IQA achieves the state-of-the-art on AGIQA-1k and AGIQA-3k datasets. Code will be available at https://github.com/Coobiw/IP-IQA.
