Table of Contents
Fetching ...

Adaptive Mixed-Scale Feature Fusion Network for Blind AI-Generated Image Quality Assessment

Tianwei Zhou, Songbai Tan, Wei Zhou, Yu Luo, Yuan-Gen Wang, Guanghui Yue

TL;DR

This work tackles blind image quality assessment for AI-generated images (AGIs) across three dimensions: visual quality, authenticity, and content consistency. It introduces AMFF-Net, a multi-task architecture that uses a multi-scale input strategy and an Adaptive Feature Fusion (AFF) block to fuse multi-scale CLIP-based features, while leveraging text-image alignment to assess consistency via a cosine similarity between fused image features $F_I$ and text features $F_T$. Empirically, AMFF-Net outperforms nine state-of-the-art blind IQA methods on three AGI QA databases, with ablations showing the positive impact of the multi-scale inputs and AFF; the cosine similarity metric for text-image alignment also yields superior performance. The approach provides a practical, scalable solution for evaluating AGIs under blind conditions, enabling more reliable quality judgment across different prompts and generation models.

Abstract

With the increasing maturity of the text-to-image and image-to-image generative models, AI-generated images (AGIs) have shown great application potential in advertisement, entertainment, education, social media, etc. Although remarkable advancements have been achieved in generative models, very few efforts have been paid to design relevant quality assessment models. In this paper, we propose a novel blind image quality assessment (IQA) network, named AMFF-Net, for AGIs. AMFF-Net evaluates AGI quality from three dimensions, i.e., "visual quality", "authenticity", and "consistency". Specifically, inspired by the characteristics of the human visual system and motivated by the observation that "visual quality" and "authenticity" are characterized by both local and global aspects, AMFF-Net scales the image up and down and takes the scaled images and original-sized image as the inputs to obtain multi-scale features. After that, an Adaptive Feature Fusion (AFF) block is used to adaptively fuse the multi-scale features with learnable weights. In addition, considering the correlation between the image and prompt, AMFF-Net compares the semantic features from text encoder and image encoder to evaluate the text-to-image alignment. We carry out extensive experiments on three AGI quality assessment databases, and the experimental results show that our AMFF-Net obtains better performance than nine state-of-the-art blind IQA methods. The results of ablation experiments further demonstrate the effectiveness of the proposed multi-scale input strategy and AFF block.

Adaptive Mixed-Scale Feature Fusion Network for Blind AI-Generated Image Quality Assessment

TL;DR

This work tackles blind image quality assessment for AI-generated images (AGIs) across three dimensions: visual quality, authenticity, and content consistency. It introduces AMFF-Net, a multi-task architecture that uses a multi-scale input strategy and an Adaptive Feature Fusion (AFF) block to fuse multi-scale CLIP-based features, while leveraging text-image alignment to assess consistency via a cosine similarity between fused image features and text features . Empirically, AMFF-Net outperforms nine state-of-the-art blind IQA methods on three AGI QA databases, with ablations showing the positive impact of the multi-scale inputs and AFF; the cosine similarity metric for text-image alignment also yields superior performance. The approach provides a practical, scalable solution for evaluating AGIs under blind conditions, enabling more reliable quality judgment across different prompts and generation models.

Abstract

With the increasing maturity of the text-to-image and image-to-image generative models, AI-generated images (AGIs) have shown great application potential in advertisement, entertainment, education, social media, etc. Although remarkable advancements have been achieved in generative models, very few efforts have been paid to design relevant quality assessment models. In this paper, we propose a novel blind image quality assessment (IQA) network, named AMFF-Net, for AGIs. AMFF-Net evaluates AGI quality from three dimensions, i.e., "visual quality", "authenticity", and "consistency". Specifically, inspired by the characteristics of the human visual system and motivated by the observation that "visual quality" and "authenticity" are characterized by both local and global aspects, AMFF-Net scales the image up and down and takes the scaled images and original-sized image as the inputs to obtain multi-scale features. After that, an Adaptive Feature Fusion (AFF) block is used to adaptively fuse the multi-scale features with learnable weights. In addition, considering the correlation between the image and prompt, AMFF-Net compares the semantic features from text encoder and image encoder to evaluate the text-to-image alignment. We carry out extensive experiments on three AGI quality assessment databases, and the experimental results show that our AMFF-Net obtains better performance than nine state-of-the-art blind IQA methods. The results of ablation experiments further demonstrate the effectiveness of the proposed multi-scale input strategy and AFF block.
Paper Structure (23 sections, 10 equations, 6 figures, 5 tables)

This paper contains 23 sections, 10 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparisons between NSIs (the first row, selected from KADID-10k lin2019kadid and KonIQ-10k lin2018koniq) and AGIs (the second row, selected from PKU-I2IQAyuan2023pku). The quality score of NSIs is mainly rated in the dimension of visual experience affected by distortions, while the quality scores of AGIs are rated in the dimensions of visual quality affected by distortions, authenticity affected by realness degree to reality, and consistency affected by the alignment between image content and textual labels.
  • Figure 2: Overview of the proposed AMFF-Net. It takes both the text prompt and three scaled AGIs as the inputs and outputs the consistency score $S_C$, visual quality score $S_V$, and authenticity score $S_A$. Here, the image and text encoders are selected from the pre-trained CLIP model. AFF and MLP denote the adaptive feature fusion block and multi-layer perception, respectively.
  • Figure 3: Architecture presentation of the proposed AFF block.
  • Figure 4: Images from three AGI quality assessment databases: (a) AGIQA-3Kli2023agiqa, (b) AIGCIQA2023wang2023aigciqa2023, and (c) PKU-I2IQAyuan2023pku.
  • Figure 5: Scatter plots of different IQA methods tested on the AGIQA-3K database. Due to the space limitation, we only show the predictions of visual quality.
  • ...and 1 more figures