UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark

Zhaokun Zhou; Qiulin Wang; Bin Lin; Yiwei Su; Rui Chen; Xin Tao; Amin Zheng; Li Yuan; Pengfei Wan; Di Zhang

UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark

Zhaokun Zhou, Qiulin Wang, Bin Lin, Yiwei Su, Rui Chen, Xin Tao, Amin Zheng, Li Yuan, Pengfei Wan, Di Zhang

TL;DR

UNIAA tackles the fragmentation of image aesthetic assessment by unifying perception, description, and assessment through a vision-language multi-modal LLM. It introduces IDCP to cheaply convert existing IAA datasets into instruction-tuning data and launches UNIAA-Bench to evaluate aesthetic capabilities across three dimensions. Empirical results show strong zero-shot perception and competitive or superior performance in perception and competitive gains in description and assessment when fine-tuning the vision encoder, while also highlighting remaining gaps to human-level aesthetics. The work provides a scalable blueprint for a universal aesthetic assistant and releases both the baseline model and benchmark to catalyze further research.

Abstract

As an alternative to expensive expert evaluation, Image Aesthetic Assessment (IAA) stands out as a crucial task in computer vision. However, traditional IAA methods are typically constrained to a single data source or task, restricting the universality and broader application. In this work, to better align with human aesthetics, we propose a Unified Multi-modal Image Aesthetic Assessment (UNIAA) framework, including a Multi-modal Large Language Model (MLLM) named UNIAA-LLaVA and a comprehensive benchmark named UNIAA-Bench. We choose MLLMs with both visual perception and language ability for IAA and establish a low-cost paradigm for transforming the existing datasets into unified and high-quality visual instruction tuning data, from which the UNIAA-LLaVA is trained. To further evaluate the IAA capability of MLLMs, we construct the UNIAA-Bench, which consists of three aesthetic levels: Perception, Description, and Assessment. Extensive experiments validate the effectiveness and rationality of UNIAA. UNIAA-LLaVA achieves competitive performance on all levels of UNIAA-Bench, compared with existing MLLMs. Specifically, our model performs better than GPT-4V in aesthetic perception and even approaches the junior-level human. We find MLLMs have great potential in IAA, yet there remains plenty of room for further improvement. The UNIAA-LLaVA and UNIAA-Bench will be released.

UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark

TL;DR

Abstract

Paper Structure (45 sections, 2 equations, 8 figures, 8 tables)

This paper contains 45 sections, 2 equations, 8 figures, 8 tables.

Introduction
Related Work
Image Aesthestic Assessment
Multi-modality Large Language Models
UNIAA-LLaVA
IAA Datasets Conversion Paradigm
Data Selection.
Data Filtration.
QA Generation.
IDCP Advantages Discussion.
Image Aesthetic Visual Instruction Tuning
UNIAA-Bench
Aesthetic Perception
Aesthetic Description
Aesthetic Assessment
...and 30 more sections

Figures (8)

Figure 1: The Unified Multi-modal Image Aesthetic Assessment Framework, containing a baseline (a) and a benchmark (b). The aesthetic perception performance of UNIAA-LLaVA and other MLLMs is shown in (c).
Figure 2: The IAA Datasets Conversion Paradigm for UNIAA-LLaVA.
Figure 3: The UNIAA-Bench overview. (a) UNIAA-QA contains 5354 Image-Question-Answer samples and (b) UNIAA-Describe contains 501 Image-Description samples. (c) For open-source MLLMs, Logits can be extracted to calculate the score.
Figure 4: Image examples of four datasets in Aesthetic Assessment. Below the images are the corresponding aesthetic scores.
Figure 5: Supervised training model of MOS prediction. F1 represents the features of the last layer of the Vision Encoder, while F2 represents the features of the second-to-last layer.
...and 3 more figures

UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark

TL;DR

Abstract

UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark

Authors

TL;DR

Abstract

Table of Contents

Figures (8)