Image Aesthetics Assessment via Learnable Queries

Zhiwei Xiong; Yunfan Zhang; Zhiqi Shen; Peiran Ren; Han Yu

Image Aesthetics Assessment via Learnable Queries

Zhiwei Xiong, Yunfan Zhang, Zhiqi Shen, Peiran Ren, Han Yu

TL;DR

The IAA-LQ approach adapts learnable queries to extract aesthetic features from pre-trained image features obtained from a frozen image encoder, beating the best state-of-the-art method by 2.2% and 2.1% in terms of SRCC and PLCC, respectively.

Abstract

Image aesthetics assessment (IAA) aims to estimate the aesthetics of images. Depending on the content of an image, diverse criteria need to be selected to assess its aesthetics. Existing works utilize pre-trained vision backbones based on content knowledge to learn image aesthetics. However, training those backbones is time-consuming and suffers from attention dispersion. Inspired by learnable queries in vision-language alignment, we propose the Image Aesthetics Assessment via Learnable Queries (IAA-LQ) approach. It adapts learnable queries to extract aesthetic features from pre-trained image features obtained from a frozen image encoder. Extensive experiments on real-world data demonstrate the advantages of IAA-LQ, beating the best state-of-the-art method by 2.2% and 2.1% in terms of SRCC and PLCC, respectively.

Image Aesthetics Assessment via Learnable Queries

TL;DR

Abstract

Paper Structure (11 sections, 7 equations, 2 figures, 5 tables)

This paper contains 11 sections, 7 equations, 2 figures, 5 tables.

Introduction
The Proposed IAA-LQ Approach
Encoding an Image
Learnable Queries & Querying Transformer
Prediction Header for IAA
Experimental Evaluation
Experiment Settings
Comparison Results
Ablation Studies
Model Interpretation
Conclusions and Future Work

Figures (2)

Figure 1: The design of IAA-LQ. It learns embeddings for learnable queries through a querying transformer, where pre-trained image features extracted with a frozen image encoder are inserted once in every two transformer blocks for cross-attention. The learned query embeddings are averaged and passed through a feed-forward layer and Softmax to output the predicted aesthetic DOS.
Figure 2: Examples of the IAA-LQ MOS prediction results. Images from the top row to the bottom row are example images with relatively high, moderate, and relatively low ground truth MOSs. The blue and (green) numbers beneath each image are its predicted and (ground truth) MOSs, respectively.

Image Aesthetics Assessment via Learnable Queries

TL;DR

Abstract

Image Aesthetics Assessment via Learnable Queries

Authors

TL;DR

Abstract

Table of Contents

Figures (2)