Table of Contents
Fetching ...

UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark

Zhaokun Zhou, Qiulin Wang, Bin Lin, Yiwei Su, Rui Chen, Xin Tao, Amin Zheng, Li Yuan, Pengfei Wan, Di Zhang

TL;DR

UNIAA tackles the fragmentation of image aesthetic assessment by unifying perception, description, and assessment through a vision-language multi-modal LLM. It introduces IDCP to cheaply convert existing IAA datasets into instruction-tuning data and launches UNIAA-Bench to evaluate aesthetic capabilities across three dimensions. Empirical results show strong zero-shot perception and competitive or superior performance in perception and competitive gains in description and assessment when fine-tuning the vision encoder, while also highlighting remaining gaps to human-level aesthetics. The work provides a scalable blueprint for a universal aesthetic assistant and releases both the baseline model and benchmark to catalyze further research.

Abstract

As an alternative to expensive expert evaluation, Image Aesthetic Assessment (IAA) stands out as a crucial task in computer vision. However, traditional IAA methods are typically constrained to a single data source or task, restricting the universality and broader application. In this work, to better align with human aesthetics, we propose a Unified Multi-modal Image Aesthetic Assessment (UNIAA) framework, including a Multi-modal Large Language Model (MLLM) named UNIAA-LLaVA and a comprehensive benchmark named UNIAA-Bench. We choose MLLMs with both visual perception and language ability for IAA and establish a low-cost paradigm for transforming the existing datasets into unified and high-quality visual instruction tuning data, from which the UNIAA-LLaVA is trained. To further evaluate the IAA capability of MLLMs, we construct the UNIAA-Bench, which consists of three aesthetic levels: Perception, Description, and Assessment. Extensive experiments validate the effectiveness and rationality of UNIAA. UNIAA-LLaVA achieves competitive performance on all levels of UNIAA-Bench, compared with existing MLLMs. Specifically, our model performs better than GPT-4V in aesthetic perception and even approaches the junior-level human. We find MLLMs have great potential in IAA, yet there remains plenty of room for further improvement. The UNIAA-LLaVA and UNIAA-Bench will be released.

UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark

TL;DR

UNIAA tackles the fragmentation of image aesthetic assessment by unifying perception, description, and assessment through a vision-language multi-modal LLM. It introduces IDCP to cheaply convert existing IAA datasets into instruction-tuning data and launches UNIAA-Bench to evaluate aesthetic capabilities across three dimensions. Empirical results show strong zero-shot perception and competitive or superior performance in perception and competitive gains in description and assessment when fine-tuning the vision encoder, while also highlighting remaining gaps to human-level aesthetics. The work provides a scalable blueprint for a universal aesthetic assistant and releases both the baseline model and benchmark to catalyze further research.

Abstract

As an alternative to expensive expert evaluation, Image Aesthetic Assessment (IAA) stands out as a crucial task in computer vision. However, traditional IAA methods are typically constrained to a single data source or task, restricting the universality and broader application. In this work, to better align with human aesthetics, we propose a Unified Multi-modal Image Aesthetic Assessment (UNIAA) framework, including a Multi-modal Large Language Model (MLLM) named UNIAA-LLaVA and a comprehensive benchmark named UNIAA-Bench. We choose MLLMs with both visual perception and language ability for IAA and establish a low-cost paradigm for transforming the existing datasets into unified and high-quality visual instruction tuning data, from which the UNIAA-LLaVA is trained. To further evaluate the IAA capability of MLLMs, we construct the UNIAA-Bench, which consists of three aesthetic levels: Perception, Description, and Assessment. Extensive experiments validate the effectiveness and rationality of UNIAA. UNIAA-LLaVA achieves competitive performance on all levels of UNIAA-Bench, compared with existing MLLMs. Specifically, our model performs better than GPT-4V in aesthetic perception and even approaches the junior-level human. We find MLLMs have great potential in IAA, yet there remains plenty of room for further improvement. The UNIAA-LLaVA and UNIAA-Bench will be released.
Paper Structure (45 sections, 2 equations, 8 figures, 8 tables)

This paper contains 45 sections, 2 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: The Unified Multi-modal Image Aesthetic Assessment Framework, containing a baseline (a) and a benchmark (b). The aesthetic perception performance of UNIAA-LLaVA and other MLLMs is shown in (c).
  • Figure 2: The IAA Datasets Conversion Paradigm for UNIAA-LLaVA.
  • Figure 3: The UNIAA-Bench overview. (a) UNIAA-QA contains 5354 Image-Question-Answer samples and (b) UNIAA-Describe contains 501 Image-Description samples. (c) For open-source MLLMs, Logits can be extracted to calculate the score.
  • Figure 4: Image examples of four datasets in Aesthetic Assessment. Below the images are the corresponding aesthetic scores.
  • Figure 5: Supervised training model of MOS prediction. F1 represents the features of the last layer of the Vision Encoder, while F2 represents the features of the second-to-last layer.
  • ...and 3 more figures