Table of Contents
Fetching ...

HumanAesExpert: Advancing a Multi-Modality Foundation Model for Human Image Aesthetic Assessment

Zhichao Liao, Xiaokun Liu, Wenyu Qin, Qingyu Li, Qiulin Wang, Pengfei Wan, Di Zhang, Long Zeng, Pingfa Feng

TL;DR

This work tackles the absence of open resources for Human Image Aesthetic Assessment (HIAA) by introducing HumanBeauty, a large-scale dataset with 108k images including 50k with 12-dimensional annotations and 58k with overall scores, enabling both holistic and fine-grained evaluation. The authors propose HumanAesExpert, a Vision-Language Model that integrates an Expert head to capture sub-dimension relationships, alongside an LM head and a Regression head, with a MetaVoter to fuse outputs for robust final scores. Training uses a two-stage scheme that leverages both overall and 12-dimensional supervision, plus QA-derived rating-level supervision to exploit continuous signals. Experiments demonstrate state-of-the-art performance on both overall HIAA and fine-grained dimensions, with strong zero-shot results and interpretable qualitative analyses, and the authors publicly release the dataset, models, and code to advance HIAA research.

Abstract

Image Aesthetic Assessment (IAA) is a long-standing and challenging research task. However, its subset, Human Image Aesthetic Assessment (HIAA), has been scarcely explored. To bridge this research gap, our work pioneers a holistic implementation framework tailored for HIAA. Specifically, we introduce HumanBeauty, the first dataset purpose-built for HIAA, which comprises 108k high-quality human images with manual annotations. To achieve comprehensive and fine-grained HIAA, 50K human images are manually collected through a rigorous curation process and annotated leveraging our trailblazing 12-dimensional aesthetic standard, while the remaining 58K with overall aesthetic labels are systematically filtered from public datasets. Based on the HumanBeauty database, we propose HumanAesExpert, a powerful Vision Language Model for aesthetic evaluation of human images. We innovatively design an Expert head to incorporate human knowledge of aesthetic sub-dimensions while jointly utilizing the Language Modeling (LM) and Regression heads. This approach empowers our model to achieve superior proficiency in both overall and fine-grained HIAA. Furthermore, we introduce a MetaVoter, which aggregates scores from all three heads, to effectively balance the capabilities of each head, thereby realizing improved assessment precision. Extensive experiments demonstrate that our HumanAesExpert models deliver significantly better performance in HIAA than other state-of-the-art models. Project webpage: https://humanaesexpert.github.io/HumanAesExpert/

HumanAesExpert: Advancing a Multi-Modality Foundation Model for Human Image Aesthetic Assessment

TL;DR

This work tackles the absence of open resources for Human Image Aesthetic Assessment (HIAA) by introducing HumanBeauty, a large-scale dataset with 108k images including 50k with 12-dimensional annotations and 58k with overall scores, enabling both holistic and fine-grained evaluation. The authors propose HumanAesExpert, a Vision-Language Model that integrates an Expert head to capture sub-dimension relationships, alongside an LM head and a Regression head, with a MetaVoter to fuse outputs for robust final scores. Training uses a two-stage scheme that leverages both overall and 12-dimensional supervision, plus QA-derived rating-level supervision to exploit continuous signals. Experiments demonstrate state-of-the-art performance on both overall HIAA and fine-grained dimensions, with strong zero-shot results and interpretable qualitative analyses, and the authors publicly release the dataset, models, and code to advance HIAA research.

Abstract

Image Aesthetic Assessment (IAA) is a long-standing and challenging research task. However, its subset, Human Image Aesthetic Assessment (HIAA), has been scarcely explored. To bridge this research gap, our work pioneers a holistic implementation framework tailored for HIAA. Specifically, we introduce HumanBeauty, the first dataset purpose-built for HIAA, which comprises 108k high-quality human images with manual annotations. To achieve comprehensive and fine-grained HIAA, 50K human images are manually collected through a rigorous curation process and annotated leveraging our trailblazing 12-dimensional aesthetic standard, while the remaining 58K with overall aesthetic labels are systematically filtered from public datasets. Based on the HumanBeauty database, we propose HumanAesExpert, a powerful Vision Language Model for aesthetic evaluation of human images. We innovatively design an Expert head to incorporate human knowledge of aesthetic sub-dimensions while jointly utilizing the Language Modeling (LM) and Regression heads. This approach empowers our model to achieve superior proficiency in both overall and fine-grained HIAA. Furthermore, we introduce a MetaVoter, which aggregates scores from all three heads, to effectively balance the capabilities of each head, thereby realizing improved assessment precision. Extensive experiments demonstrate that our HumanAesExpert models deliver significantly better performance in HIAA than other state-of-the-art models. Project webpage: https://humanaesexpert.github.io/HumanAesExpert/

Paper Structure

This paper contains 12 sections, 7 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Our HumanAesExpert, compared to existing state-of-the-art methods, shows exceptional improvements. $\uparrow$ indicates that larger values are better, $\downarrow$ signifies the opposite.
  • Figure 2: HumanBeauty construction pipeline. First, we select six diverse open-source datasets as data sources and perform data filtering to build our HumanBeauty-58k. Additionally, we manually collect and annotate 50k human images across multiple dimensions to create our HumanBeauty-50k. Finally, we map all the scores into text of rating level to form QA pairs for training.
  • Figure 3: (a) The training path of the human images with only overall annotations and 12-dimensional annotations are highlighted with purple and yellow, respectively. (b) The Expert head is a sparsely connected MLP, with each node being supervised.
  • Figure 4: Statistical Analysis and Train-Test Split.
  • Figure 5: The visualization results of our model, where "( )" indicate the Ground Truth scores. From A to L, they respectively represent facial brightness, facial feature clarity, facial skin tone, facial structure, facial contour clarity, facial aesthetic, outfit, body shape, looks, general appearance aesthetic, environment and overall aesthetic scores.