Can Vision Language Models Assess Graphic Design Aesthetics? A Benchmark, Evaluation, and Dataset Perspective

Arctanx An; Shizhao Sun; Danqing Huang; Mingxi Cheng; Yan Gao; Ji Li; Yu Qiao; Jiang Bian

Can Vision Language Models Assess Graphic Design Aesthetics? A Benchmark, Evaluation, and Dataset Perspective

Arctanx An, Shizhao Sun, Danqing Huang, Mingxi Cheng, Yan Gao, Ji Li, Yu Qiao, Jiang Bian

TL;DR

This work systematically evaluate proprietary, open-source, and reasoning-augmented VLMs, revealing clear performance gaps against the nuanced demands of aesthetic assessment, and establishes the first systematic framework for aesthetic quality assessment in graphic design.

Abstract

Assessing the aesthetic quality of graphic design is central to visual communication, yet remains underexplored in vision language models (VLMs). We investigate whether VLMs can evaluate design aesthetics in ways comparable to humans. Prior work faces three key limitations: benchmarks restricted to narrow principles and coarse evaluation protocols, a lack of systematic VLM comparisons, and limited training data for model improvement. In this work, we introduce AesEval-Bench, a comprehensive benchmark spanning four dimensions, twelve indicators, and three fully quantifiable tasks: aesthetic judgment, region selection, and precise localization. Then, we systematically evaluate proprietary, open-source, and reasoning-augmented VLMs, revealing clear performance gaps against the nuanced demands of aesthetic assessment. Moreover, we construct a training dataset to fine-tune VLMs for this domain, leveraging human-guided VLM labeling to produce task labels at scale and indicator-grounded reasoning to tie abstract indicators to concrete design regions.Together, our work establishes the first systematic framework for aesthetic quality assessment in graphic design. Our code and dataset will be released at: \href{https://github.com/arctanxarc/AesEval-Bench}{https://github.com/arctanxarc/AesEval-Bench}

Can Vision Language Models Assess Graphic Design Aesthetics? A Benchmark, Evaluation, and Dataset Perspective

TL;DR

Abstract

Paper Structure (20 sections, 5 figures, 17 tables, 1 algorithm)

This paper contains 20 sections, 5 figures, 17 tables, 1 algorithm.

Introduction
Related Works
Benchmark Construction
Overview
Curation Pipeline
Evaluation Protocols
Training Data Construction
Experiment
Benchmarking VLMs on AesEval-Bench
Fine-tuning VLMs with AesEval-Train
Conclusion
Statement of LLM Usage
Prompts and Instructions
Additional Related Work
Data Source Showcase
...and 5 more sections

Figures (5)

Figure 1: Overview of AesEval-Bench. (A) The four dimensions and twelve indicators considered in the benchmark. Numbers inside the circles indicate how many designs are labeled as flawed for each indicator. (B) Example designs illustrating the indicators, with regions exhibiting aesthetic issues highlighted by red boxes. Detailed textual explanations of all indicators are provided in the Appendix. (C) The three tasks, along with example questions and their expected answers.
Figure 2: (A) Illustration of two key steps in training data construction. Human-guided VLM labeling enables scalable determination of whether designs exhibit aesthetic issues. Indicator-grounded reasoning generates reasoning paths that explicitly link abstract indicators to concrete design regions (represented as bbox coordinates). (B) Example highlighting the difference between non-reasoning models, generic reasoning models, and our indicator-grounded reasoning model.
Figure 3: Results for model variants using different input components.
Figure 4: Diverse design samples sourced from the Crello dataset. The collection demonstrates a wide spectrum of visual styles and structural layouts, including (a) minimalist typography-centric designs (e.g., "Corporate Charity"), (b) photography-driven fashion editorials featuring real human subjects, (c) vintage illustrations, (d) photorealistic tech mockups, (e) geometric abstract art, (f) textured artistic typography, and (g) cyberpunk-themed certificates. This visual evidence refutes the concern of stylistic homogeneity, confirming the dataset's robust coverage across design domains.
Figure 5: Representative samples from real-world flawed cases collected by professional designers. Unlike the synthetically perturbed benchmark, these designs were curated by professional designers to represent authentic aesthetic defects encountered in real-world workflows. The samples exhibit flaws such as (a) visual clutter and inconsistent orientation (e.g., "World AIDS Day"), (b) spatial imbalance and disconnected elements (e.g., "Startup Job Fair"), (c) grid-based alignment errors (e.g., "School Magazine"), and (d) typographic obstruction (e.g., "End Violence"). This dataset serves as a rigorous Out-Of-Distribution benchmark to evaluate model generalization beyond synthetic patterns.

Can Vision Language Models Assess Graphic Design Aesthetics? A Benchmark, Evaluation, and Dataset Perspective

TL;DR

Abstract

Can Vision Language Models Assess Graphic Design Aesthetics? A Benchmark, Evaluation, and Dataset Perspective

Authors

TL;DR

Abstract

Table of Contents

Figures (5)