Table of Contents
Fetching ...

QUASAR: QUality and Aesthetics Scoring with Advanced Representations

Sergey Kastryulin, Denis Prokopenko, Artem Babenko, Dmitry V. Dylov

TL;DR

This work tackles the need for generalizable image quality and aesthetics assessment without prompt engineering. It introduces QUASAR, a data-driven non-parametric framework built from Anchor Data, an Image Encoder, and an Aggregation Function to produce a unified score. Across 8 benchmarks and 7 self-supervised models, QUASAR outperforms CLIP-IQA and demonstrates robustness to data preprocessing and anchor choice. The approach yields high agreement with human judgments, even with limited data, and offers a scalable, universal tool for evaluating both technical quality and aesthetics of visual content.

Abstract

This paper introduces a new data-driven, non-parametric method for image quality and aesthetics assessment, surpassing existing approaches and requiring no prompt engineering or fine-tuning. We eliminate the need for expressive textual embeddings by proposing efficient image anchors in the data. Through extensive evaluations of 7 state-of-the-art self-supervised models, our method demonstrates superior performance and robustness across various datasets and benchmarks. Notably, it achieves high agreement with human assessments even with limited data and shows high robustness to the nature of data and their pre-processing pipeline. Our contributions offer a streamlined solution for assessment of images while providing insights into the perception of visual information.

QUASAR: QUality and Aesthetics Scoring with Advanced Representations

TL;DR

This work tackles the need for generalizable image quality and aesthetics assessment without prompt engineering. It introduces QUASAR, a data-driven non-parametric framework built from Anchor Data, an Image Encoder, and an Aggregation Function to produce a unified score. Across 8 benchmarks and 7 self-supervised models, QUASAR outperforms CLIP-IQA and demonstrates robustness to data preprocessing and anchor choice. The approach yields high agreement with human judgments, even with limited data, and offers a scalable, universal tool for evaluating both technical quality and aesthetics of visual content.

Abstract

This paper introduces a new data-driven, non-parametric method for image quality and aesthetics assessment, surpassing existing approaches and requiring no prompt engineering or fine-tuning. We eliminate the need for expressive textual embeddings by proposing efficient image anchors in the data. Through extensive evaluations of 7 state-of-the-art self-supervised models, our method demonstrates superior performance and robustness across various datasets and benchmarks. Notably, it achieves high agreement with human assessments even with limited data and shows high robustness to the nature of data and their pre-processing pipeline. Our contributions offer a streamlined solution for assessment of images while providing insights into the perception of visual information.
Paper Structure (15 sections, 3 equations, 13 figures, 1 table)

This paper contains 15 sections, 3 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: (a) The best performance of QUASAR and CLIP-IQA metrics on 6 image quality and 2 image aesthetics benchmarks. Note the large gain on widely used TID2013, KADID10k, and the most challenging TAD66k datasets. (b) The proposed metric enables a clear separation of embeddings, as shown for the AVA data.
  • Figure 2: (a) QUASAR framework consists of Anchor Data, Image Encoder, and Aggregation Function. Geven an input image, the centroids are used to compute the final score $\bar{s}$. (b) Three types of Aggregation Functions. Each Aggregation Function produces centroids with distinct properties.
  • Figure 3: Samples from LIVEitW dataset, divided into three categories, according to MOS: (a) top, (b) median, and (c) low.
  • Figure 4: Comparison of QUASAR and CLIP-IQA variants with standard OpenCLIP ilharco_gabriel_2021_5143773cherti2023reproducible, resizing to 224 px resolution and allowing for direct use of ViT backbones. RN50 and RN50* denote ResNet-50 backbones with and without positional embeddings, respectively. Note the gain in performance between CLIP-IQA and QUASAR, showcasing robustness of the latter to small tweaks in the pre-processing pipeline. (a) Variants of QUASAR for IQA with varying the anchor data (PIPAL, KADIS700k). Despite being smaller and having different set of distortions, PIPAL version has comparable performance, showcasing robustness of QUASAR to the choise of anchor data. (b) QUASAR variant with Mean aggregation and resizing on IAA datasets. Note the large and consistent performance gain over the prompt-based metric.
  • Figure 5: QUASAR and CLIP-IQA performance, depending on the resolution of input data. Unlike QUASAR, CLIP-IQA shows a significant drop of SRCC when images are resized to the uniform 224 px resolution on SPAQ, KonIQ10k, and LIVEitW datasets. RN50 and RN50* denote ResNet-50 backbones with and without positional embeddings, respectively.
  • ...and 8 more figures