Table of Contents
Fetching ...

Advancing Comprehensive Aesthetic Insight with Multi-Scale Text-Guided Self-Supervised Learning

Yuti Liu, Shice Liu, Junyuan Gao, Pengtao Jiang, Hao Zhang, Jinwei Chen, Bo Li

TL;DR

This work introduces CALM, a Comprehensive Aesthetic Large Language Model, to advance holistic image aesthetic assessment. It combines a visual encoder, a Multi-scale Feature Alignment Module (MFAM), and a large language model, trained with a text-guided self-supervised learning framework that leverages unlabeled data through attribute-based pseudo-labels and GPT-3.5-generated textual cues. Through a two-stage instruct-tuning process, CALM achieves state-of-the-art performance on aesthetic scoring, commenting, and personalized image aesthetic assessment, and demonstrates zero-shot capabilities in aesthetic suggesting as well as in-context learning for PIAA. The approach advances multi-task aesthetic understanding, offering practical capabilities for end-to-end aesthetic analysis and guidance, with CALM-E further boosting performance by incorporating expansive generic and aesthetic QA data.

Abstract

Image Aesthetic Assessment (IAA) is a vital and intricate task that entails analyzing and assessing an image's aesthetic values, and identifying its highlights and areas for improvement. Traditional methods of IAA often concentrate on a single aesthetic task and suffer from inadequate labeled datasets, thus impairing in-depth aesthetic comprehension. Despite efforts to overcome this challenge through the application of Multi-modal Large Language Models (MLLMs), such models remain underdeveloped for IAA purposes. To address this, we propose a comprehensive aesthetic MLLM capable of nuanced aesthetic insight. Central to our approach is an innovative multi-scale text-guided self-supervised learning technique. This technique features a multi-scale feature alignment module and capitalizes on a wealth of unlabeled data in a self-supervised manner to structurally and functionally enhance aesthetic ability. The empirical evidence indicates that accompanied with extensive instruct-tuning, our model sets new state-of-the-art benchmarks across multiple tasks, including aesthetic scoring, aesthetic commenting, and personalized image aesthetic assessment. Remarkably, it also demonstrates zero-shot learning capabilities in the emerging task of aesthetic suggesting. Furthermore, for personalized image aesthetic assessment, we harness the potential of in-context learning and showcase its inherent advantages.

Advancing Comprehensive Aesthetic Insight with Multi-Scale Text-Guided Self-Supervised Learning

TL;DR

This work introduces CALM, a Comprehensive Aesthetic Large Language Model, to advance holistic image aesthetic assessment. It combines a visual encoder, a Multi-scale Feature Alignment Module (MFAM), and a large language model, trained with a text-guided self-supervised learning framework that leverages unlabeled data through attribute-based pseudo-labels and GPT-3.5-generated textual cues. Through a two-stage instruct-tuning process, CALM achieves state-of-the-art performance on aesthetic scoring, commenting, and personalized image aesthetic assessment, and demonstrates zero-shot capabilities in aesthetic suggesting as well as in-context learning for PIAA. The approach advances multi-task aesthetic understanding, offering practical capabilities for end-to-end aesthetic analysis and guidance, with CALM-E further boosting performance by incorporating expansive generic and aesthetic QA data.

Abstract

Image Aesthetic Assessment (IAA) is a vital and intricate task that entails analyzing and assessing an image's aesthetic values, and identifying its highlights and areas for improvement. Traditional methods of IAA often concentrate on a single aesthetic task and suffer from inadequate labeled datasets, thus impairing in-depth aesthetic comprehension. Despite efforts to overcome this challenge through the application of Multi-modal Large Language Models (MLLMs), such models remain underdeveloped for IAA purposes. To address this, we propose a comprehensive aesthetic MLLM capable of nuanced aesthetic insight. Central to our approach is an innovative multi-scale text-guided self-supervised learning technique. This technique features a multi-scale feature alignment module and capitalizes on a wealth of unlabeled data in a self-supervised manner to structurally and functionally enhance aesthetic ability. The empirical evidence indicates that accompanied with extensive instruct-tuning, our model sets new state-of-the-art benchmarks across multiple tasks, including aesthetic scoring, aesthetic commenting, and personalized image aesthetic assessment. Remarkably, it also demonstrates zero-shot learning capabilities in the emerging task of aesthetic suggesting. Furthermore, for personalized image aesthetic assessment, we harness the potential of in-context learning and showcase its inherent advantages.

Paper Structure

This paper contains 13 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: The functional comparison of our proposed CALM and other IAA methods.
  • Figure 2: The proposed CALM includes a visual encoder, a multi-scale feature alignment module and a large language model.
  • Figure 3: Some instruction examples utilized throughout the entire training process and across various tasks.
  • Figure 4: The two-stage training procedure. The pre-training stage focuses solely on the MFAM, while the fine-tuning stage also refines the LLM. The datasets are: unlabeled images (U), generic image-text pairs (G), aesthetic image-comment pairs (C), and aesthetic image-score pairs (S).
  • Figure 5: Qualitative comparison of aesthetic commenting. The red comments are correct, while the green ones are wrong.
  • ...and 1 more figures