Table of Contents
Fetching ...

ATTIQA: Generalizable Image Quality Feature Extractor using Attribute-aware Pretraining

Daekyu Kwon, Dongyoung Kim, Sehwan Ki, Younghyun Jo, Hyong-Euk Lee, Seon Joo Kim

TL;DR

A novel pretraining framework is proposed that constructs a generalizable representation for IQA by selectively extracting quality-related knowledge from VLM and leveraging the scalability of large datasets, achieving state-of-the-art performance on multiple IQA datasets and exhibits remarkable generalization capabilities.

Abstract

In no-reference image quality assessment (NR-IQA), the challenge of limited dataset sizes hampers the development of robust and generalizable models. Conventional methods address this issue by utilizing large datasets to extract rich representations for IQA. Also, some approaches propose vision language models (VLM) based IQA, but the domain gap between generic VLM and IQA constrains their scalability. In this work, we propose a novel pretraining framework that constructs a generalizable representation for IQA by selectively extracting quality-related knowledge from VLM and leveraging the scalability of large datasets. Specifically, we select optimal text prompts for five representative image quality attributes and use VLM to generate pseudo-labels. Numerous attribute-aware pseudo-labels can be generated with large image datasets, allowing our IQA model to learn rich representations about image quality. Our approach achieves state-of-the-art performance on multiple IQA datasets and exhibits remarkable generalization capabilities. Leveraging these strengths, we propose several applications, such as evaluating image generation models and training image enhancement models, demonstrating our model's real-world applicability.

ATTIQA: Generalizable Image Quality Feature Extractor using Attribute-aware Pretraining

TL;DR

A novel pretraining framework is proposed that constructs a generalizable representation for IQA by selectively extracting quality-related knowledge from VLM and leveraging the scalability of large datasets, achieving state-of-the-art performance on multiple IQA datasets and exhibits remarkable generalization capabilities.

Abstract

In no-reference image quality assessment (NR-IQA), the challenge of limited dataset sizes hampers the development of robust and generalizable models. Conventional methods address this issue by utilizing large datasets to extract rich representations for IQA. Also, some approaches propose vision language models (VLM) based IQA, but the domain gap between generic VLM and IQA constrains their scalability. In this work, we propose a novel pretraining framework that constructs a generalizable representation for IQA by selectively extracting quality-related knowledge from VLM and leveraging the scalability of large datasets. Specifically, we select optimal text prompts for five representative image quality attributes and use VLM to generate pseudo-labels. Numerous attribute-aware pseudo-labels can be generated with large image datasets, allowing our IQA model to learn rich representations about image quality. Our approach achieves state-of-the-art performance on multiple IQA datasets and exhibits remarkable generalization capabilities. Leveraging these strengths, we propose several applications, such as evaluating image generation models and training image enhancement models, demonstrating our model's real-world applicability.
Paper Structure (18 sections, 4 equations, 4 figures, 9 tables)

This paper contains 18 sections, 4 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: An illustration of a training strategy for IQA across previous works and ours. (a) Classic IQA models use ImageNet-pretrained models or suggest image quality-related pretraining. (b) CLIP-based IQA directly utilizes CLIP or adapts it for IQA using additional quality annotations, which requires human labor. (c) Our method incorporates the rich representation of large datasets and leverages CLIP's IQA capability. We pretrain IQA model with attribute-aware pseudo-labels derived from CLIP and finetune it to the target IQA dataset. (d) Cross dataset validation results, obtained by testing on the KonIQ dataset after training on various datasets. ATTIQA achieves state-of-the-art results and exhibits superior generalization capability on unseen datasets, showing less performance decline on cross-dataset setup compared to other methods.
  • Figure 2: (a) The overall process of our prompt selection strategy for each image attribute (e.g. brightness). Given attribute, we create prompt candidates using GPT-4 and then find the optimal prompt by utilizing proxy tasks related to the attribute. (b) ATTIQA's proposed pretraining pipeline. We generate attribute scores using CLIP with an antonym strategy and then train our target IQA model using ranking-based loss with generated scores.
  • Figure 3: Example of generated images. The images are generated by the same prompt. ATTIQA hits human preference while others do not.
  • Figure 4: Qualitative comparisons between our enhancement method and retouching of Expert C. Our results give more liveliness and vibrancy, aligned more closely with human preference.