Table of Contents
Fetching ...

TIER: Text-Image Encoder-based Regression for AIGC Image Quality Assessment

Jiquan Yuan, Xinyan Cao, Jinming Che, Qinyuan Wang, Sen Liang, Wei Ren, Jinlong Lin, Xixin Cao

TL;DR

The paper tackles AIGC image quality assessment by addressing the underutilization of text prompts in prior approaches. It introduces TIER, a multimodal regression framework that encodes both the generated image and its text prompt using a text encoder and an image encoder, respectively, followed by a regression head to predict quality scores. Evaluations on AGIQA-1K, AGIQA-3K, and AIGCIQA2023 show that incorporating text prompts generally yields superior performance over an image-only baseline, highlighting the value of prompt-aware representations. While promising, some metrics such as correspondence scoring on AIGCIQA2023 indicate room for better modeling of text-image relationships; overall, the work advances practical AIGCIQA by leveraging multimodal cues to improve assessment reliability and applicability in real-world AIGC workflows.

Abstract

Recently, AIGC image quality assessment (AIGCIQA), which aims to assess the quality of AI-generated images (AIGIs) from a human perception perspective, has emerged as a new topic in computer vision. Unlike common image quality assessment tasks where images are derived from original ones distorted by noise, blur, and compression, \textit{etc.}, in AIGCIQA tasks, images are typically generated by generative models using text prompts. Considerable efforts have been made in the past years to advance AIGCIQA. However, most existing AIGCIQA methods regress predicted scores directly from individual generated images, overlooking the information contained in the text prompts of these images. This oversight partially limits the performance of these AIGCIQA methods. To address this issue, we propose a text-image encoder-based regression (TIER) framework. Specifically, we process the generated images and their corresponding text prompts as inputs, utilizing a text encoder and an image encoder to extract features from these text prompts and generated images, respectively. To demonstrate the effectiveness of our proposed TIER method, we conduct extensive experiments on several mainstream AIGCIQA databases, including AGIQA-1K, AGIQA-3K, and AIGCIQA2023. The experimental results indicate that our proposed TIER method generally demonstrates superior performance compared to baseline in most cases.

TIER: Text-Image Encoder-based Regression for AIGC Image Quality Assessment

TL;DR

The paper tackles AIGC image quality assessment by addressing the underutilization of text prompts in prior approaches. It introduces TIER, a multimodal regression framework that encodes both the generated image and its text prompt using a text encoder and an image encoder, respectively, followed by a regression head to predict quality scores. Evaluations on AGIQA-1K, AGIQA-3K, and AIGCIQA2023 show that incorporating text prompts generally yields superior performance over an image-only baseline, highlighting the value of prompt-aware representations. While promising, some metrics such as correspondence scoring on AIGCIQA2023 indicate room for better modeling of text-image relationships; overall, the work advances practical AIGCIQA by leveraging multimodal cues to improve assessment reliability and applicability in real-world AIGC workflows.

Abstract

Recently, AIGC image quality assessment (AIGCIQA), which aims to assess the quality of AI-generated images (AIGIs) from a human perception perspective, has emerged as a new topic in computer vision. Unlike common image quality assessment tasks where images are derived from original ones distorted by noise, blur, and compression, \textit{etc.}, in AIGCIQA tasks, images are typically generated by generative models using text prompts. Considerable efforts have been made in the past years to advance AIGCIQA. However, most existing AIGCIQA methods regress predicted scores directly from individual generated images, overlooking the information contained in the text prompts of these images. This oversight partially limits the performance of these AIGCIQA methods. To address this issue, we propose a text-image encoder-based regression (TIER) framework. Specifically, we process the generated images and their corresponding text prompts as inputs, utilizing a text encoder and an image encoder to extract features from these text prompts and generated images, respectively. To demonstrate the effectiveness of our proposed TIER method, we conduct extensive experiments on several mainstream AIGCIQA databases, including AGIQA-1K, AGIQA-3K, and AIGCIQA2023. The experimental results indicate that our proposed TIER method generally demonstrates superior performance compared to baseline in most cases.
Paper Structure (21 sections, 4 equations, 3 figures, 1 table)

This paper contains 21 sections, 4 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: (a) In common image quality assessment tasks, images are derived from original ones distorted by noise, blur, and compression, etc. (b) In AIGCIQA tasks, images are typically generated by generative models using text prompts.
  • Figure 2: The pipeline of our proposed text-image encoder-based regression (TIER) framework. We process the generated images and their corresponding text prompts as inputs, utilizing a text encoder and an image encoder to extract features from these text prompts and generated images, respectively. The text encoder employs a text transformer model commonly used in natural language processing (NLP), while the image encoder can be a convolutional neural networks (CNN) or a vision transformer. These extracted text and image features are then concatenated and fed into a regression network to regress predicted scores.
  • Figure 3: Comparisons of our proposed TIER method utilizing different Text encoders and Image encoders with baseline on three mainstream AIGCIQA databases.