Table of Contents
Fetching ...

ArtiMuse: Fine-Grained Image Aesthetics Assessment with Joint Scoring and Expert-Level Understanding

Shuo Cao, Nan Ma, Jiayang Li, Xiaohui Li, Lihao Shao, Kaiwen Zhu, Yu Zhou, Yuandong Pu, Jiarui Wu, Jiaquan Wang, Bo Qu, Wenhai Wang, Yu Qiao, Dajuin Yao, Yihao Liu

TL;DR

ArtiMuse addresses the need for both precise quantitative aesthetics scoring and expert-level textual interpretation by introducing a joint multi-modal LLM framework and a large, expert-annotated ArtiMuse-10K dataset. It deploys a novel Token As Score mechanism to map continuous scores into existing tokenizer tokens for robust, fine-grained scoring within an 8-attribute descriptive framework. The approach delivers state-of-the-art or near state-of-the-art results on multiple aesthetics benchmarks and demonstrates strong cross-dataset generalization, while providing rich, expert-level textual analyses of images. The work enables more reliable, interpretable aesthetic assessment across diverse image domains, including AIGC content. Limitations include the current lack of automated aesthetic enhancement recommendations, which are earmarked for future work.

Abstract

The rapid advancement of educational applications, artistic creation, and AI-generated content (AIGC) technologies has substantially increased practical requirements for comprehensive Image Aesthetics Assessment (IAA), particularly demanding methods capable of delivering both quantitative scoring and professional understanding. Multimodal Large Language Model (MLLM)-based IAA methods demonstrate stronger perceptual and generalization capabilities compared to traditional approaches, yet they suffer from modality bias (score-only or text-only) and lack fine-grained attribute decomposition, thereby failing to support further aesthetic assessment. In this paper, we present:(1) ArtiMuse, an innovative MLLM-based IAA model with Joint Scoring and Expert-Level Understanding capabilities; (2) ArtiMuse-10K, the first expert-curated image aesthetic dataset comprising 10,000 images spanning 5 main categories and 15 subcategories, each annotated by professional experts with 8-dimensional attributes analysis and a holistic score. Both the model and dataset will be made public to advance the field.

ArtiMuse: Fine-Grained Image Aesthetics Assessment with Joint Scoring and Expert-Level Understanding

TL;DR

ArtiMuse addresses the need for both precise quantitative aesthetics scoring and expert-level textual interpretation by introducing a joint multi-modal LLM framework and a large, expert-annotated ArtiMuse-10K dataset. It deploys a novel Token As Score mechanism to map continuous scores into existing tokenizer tokens for robust, fine-grained scoring within an 8-attribute descriptive framework. The approach delivers state-of-the-art or near state-of-the-art results on multiple aesthetics benchmarks and demonstrates strong cross-dataset generalization, while providing rich, expert-level textual analyses of images. The work enables more reliable, interpretable aesthetic assessment across diverse image domains, including AIGC content. Limitations include the current lack of automated aesthetic enhancement recommendations, which are earmarked for future work.

Abstract

The rapid advancement of educational applications, artistic creation, and AI-generated content (AIGC) technologies has substantially increased practical requirements for comprehensive Image Aesthetics Assessment (IAA), particularly demanding methods capable of delivering both quantitative scoring and professional understanding. Multimodal Large Language Model (MLLM)-based IAA methods demonstrate stronger perceptual and generalization capabilities compared to traditional approaches, yet they suffer from modality bias (score-only or text-only) and lack fine-grained attribute decomposition, thereby failing to support further aesthetic assessment. In this paper, we present:(1) ArtiMuse, an innovative MLLM-based IAA model with Joint Scoring and Expert-Level Understanding capabilities; (2) ArtiMuse-10K, the first expert-curated image aesthetic dataset comprising 10,000 images spanning 5 main categories and 15 subcategories, each annotated by professional experts with 8-dimensional attributes analysis and a holistic score. Both the model and dataset will be made public to advance the field.

Paper Structure

This paper contains 41 sections, 2 equations, 31 figures, 13 tables.

Figures (31)

  • Figure 1: ArtiMuse provides granular, expert-level textual understanding results for images across eight fine-grained aesthetic attributes. Additionally, it achieves precise image aesthetics scoring, significantly outperforming state-of-the-art models across multiple widely-used benchmarks.
  • Figure 2: In comparison with existing models, ArtiMuse outperforms them by simultaneously achieving both accurate evaluation and precise aesthetics scoring in multi-dimensional assessments.
  • Figure 3: Data examples in ArtiMuse-10K.
  • Figure 4: Composition of ArtiMuse-10K.
  • Figure 5: Overview of ArtiMuse. ArtiMuse encompasses a multi-stage pipeline spanning data collection & processing, annotation generation, and model training, systematically enhancing its text evaluation capabilities and score assessment proficiency across multiple dimensions.
  • ...and 26 more figures