Table of Contents
Fetching ...

Bringing Textual Prompt to AI-Generated Image Quality Assessment

Bowen Qu, Haohui Li, Wei Gao

TL;DR

This work tackles AGI quality assessment by explicitly modeling the image and its textual prompt, addressing a core gap in unimodal IQA approaches. It introduces IP-IQA, a CLIP-based dual-stream framework augmented with Image2Prompt incremental pretraining, an image-prompt fusion module, and a specialized [QA] token to capture image-text alignment and perceptual quality. Empirical results on AGIQA-1k and AGIQA-3k demonstrate state-of-the-art performance and highlight the additive value of prompt integration and cross-modal fusion. The approach advances practical AGIQA by reflecting the multimodal nature of AI-generated content, with potential impact on evaluation pipelines and prompt-aware generation quality control.

Abstract

AI-Generated Images (AGIs) have inherent multimodal nature. Unlike traditional image quality assessment (IQA) on natural scenarios, AGIs quality assessment (AGIQA) takes the correspondence of image and its textual prompt into consideration. This is coupled in the ground truth score, which confuses the unimodal IQA methods. To solve this problem, we introduce IP-IQA (AGIs Quality Assessment via Image and Prompt), a multimodal framework for AGIQA via corresponding image and prompt incorporation. Specifically, we propose a novel incremental pretraining task named Image2Prompt for better understanding of AGIs and their corresponding textual prompts. An effective and efficient image-prompt fusion module, along with a novel special [QA] token, are also applied. Both are plug-and-play and beneficial for the cooperation of image and its corresponding prompt. Experiments demonstrate that our IP-IQA achieves the state-of-the-art on AGIQA-1k and AGIQA-3k datasets. Code will be available at https://github.com/Coobiw/IP-IQA.

Bringing Textual Prompt to AI-Generated Image Quality Assessment

TL;DR

This work tackles AGI quality assessment by explicitly modeling the image and its textual prompt, addressing a core gap in unimodal IQA approaches. It introduces IP-IQA, a CLIP-based dual-stream framework augmented with Image2Prompt incremental pretraining, an image-prompt fusion module, and a specialized [QA] token to capture image-text alignment and perceptual quality. Empirical results on AGIQA-1k and AGIQA-3k demonstrate state-of-the-art performance and highlight the additive value of prompt integration and cross-modal fusion. The approach advances practical AGIQA by reflecting the multimodal nature of AI-generated content, with potential impact on evaluation pipelines and prompt-aware generation quality control.

Abstract

AI-Generated Images (AGIs) have inherent multimodal nature. Unlike traditional image quality assessment (IQA) on natural scenarios, AGIs quality assessment (AGIQA) takes the correspondence of image and its textual prompt into consideration. This is coupled in the ground truth score, which confuses the unimodal IQA methods. To solve this problem, we introduce IP-IQA (AGIs Quality Assessment via Image and Prompt), a multimodal framework for AGIQA via corresponding image and prompt incorporation. Specifically, we propose a novel incremental pretraining task named Image2Prompt for better understanding of AGIs and their corresponding textual prompts. An effective and efficient image-prompt fusion module, along with a novel special [QA] token, are also applied. Both are plug-and-play and beneficial for the cooperation of image and its corresponding prompt. Experiments demonstrate that our IP-IQA achieves the state-of-the-art on AGIQA-1k and AGIQA-3k datasets. Code will be available at https://github.com/Coobiw/IP-IQA.
Paper Structure (16 sections, 5 equations, 3 figures, 3 tables)

This paper contains 16 sections, 5 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Quality assessment results generated by ResNet50 on the AGIQA-1k dataset. As seen, ResNet50 tends to assess image quality without analyzing the correspondence between image and text prompt, generating unsatisfactory assessment scores.
  • Figure 2: Detailed overview of the IP-IQA framework. (a) presents the Image2Prompt incremental pretraining framework. (b) illustrates the IP-IQA framework featuring a modular image-prompt fusion component with the trainable [QA] special token designed for quality assessment. (c) shows the workflows of Attention Pooling module and the Cross-Modality Attention Pooling module, highlighting the variation in the global query token. The global visual token is computed by spatially global average pooling (GAP) operation.
  • Figure 3: Visualization of the attention maps within the Cross-Modality Attention Pooling module. From left to right: (a) input image; (b) highlights to the object-specific word "aircraft"; (c) highlights to the scene-specific word "city".