Table of Contents
Fetching ...

Can Vision Language Models Assess Graphic Design Aesthetics? A Benchmark, Evaluation, and Dataset Perspective

Arctanx An, Shizhao Sun, Danqing Huang, Mingxi Cheng, Yan Gao, Ji Li, Yu Qiao, Jiang Bian

TL;DR

This work systematically evaluate proprietary, open-source, and reasoning-augmented VLMs, revealing clear performance gaps against the nuanced demands of aesthetic assessment, and establishes the first systematic framework for aesthetic quality assessment in graphic design.

Abstract

Assessing the aesthetic quality of graphic design is central to visual communication, yet remains underexplored in vision language models (VLMs). We investigate whether VLMs can evaluate design aesthetics in ways comparable to humans. Prior work faces three key limitations: benchmarks restricted to narrow principles and coarse evaluation protocols, a lack of systematic VLM comparisons, and limited training data for model improvement. In this work, we introduce AesEval-Bench, a comprehensive benchmark spanning four dimensions, twelve indicators, and three fully quantifiable tasks: aesthetic judgment, region selection, and precise localization. Then, we systematically evaluate proprietary, open-source, and reasoning-augmented VLMs, revealing clear performance gaps against the nuanced demands of aesthetic assessment. Moreover, we construct a training dataset to fine-tune VLMs for this domain, leveraging human-guided VLM labeling to produce task labels at scale and indicator-grounded reasoning to tie abstract indicators to concrete design regions.Together, our work establishes the first systematic framework for aesthetic quality assessment in graphic design. Our code and dataset will be released at: \href{https://github.com/arctanxarc/AesEval-Bench}{https://github.com/arctanxarc/AesEval-Bench}

Can Vision Language Models Assess Graphic Design Aesthetics? A Benchmark, Evaluation, and Dataset Perspective

TL;DR

This work systematically evaluate proprietary, open-source, and reasoning-augmented VLMs, revealing clear performance gaps against the nuanced demands of aesthetic assessment, and establishes the first systematic framework for aesthetic quality assessment in graphic design.

Abstract

Assessing the aesthetic quality of graphic design is central to visual communication, yet remains underexplored in vision language models (VLMs). We investigate whether VLMs can evaluate design aesthetics in ways comparable to humans. Prior work faces three key limitations: benchmarks restricted to narrow principles and coarse evaluation protocols, a lack of systematic VLM comparisons, and limited training data for model improvement. In this work, we introduce AesEval-Bench, a comprehensive benchmark spanning four dimensions, twelve indicators, and three fully quantifiable tasks: aesthetic judgment, region selection, and precise localization. Then, we systematically evaluate proprietary, open-source, and reasoning-augmented VLMs, revealing clear performance gaps against the nuanced demands of aesthetic assessment. Moreover, we construct a training dataset to fine-tune VLMs for this domain, leveraging human-guided VLM labeling to produce task labels at scale and indicator-grounded reasoning to tie abstract indicators to concrete design regions.Together, our work establishes the first systematic framework for aesthetic quality assessment in graphic design. Our code and dataset will be released at: \href{https://github.com/arctanxarc/AesEval-Bench}{https://github.com/arctanxarc/AesEval-Bench}
Paper Structure (20 sections, 5 figures, 17 tables, 1 algorithm)

This paper contains 20 sections, 5 figures, 17 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of AesEval-Bench. (A) The four dimensions and twelve indicators considered in the benchmark. Numbers inside the circles indicate how many designs are labeled as flawed for each indicator. (B) Example designs illustrating the indicators, with regions exhibiting aesthetic issues highlighted by red boxes. Detailed textual explanations of all indicators are provided in the Appendix. (C) The three tasks, along with example questions and their expected answers.
  • Figure 2: (A) Illustration of two key steps in training data construction. Human-guided VLM labeling enables scalable determination of whether designs exhibit aesthetic issues. Indicator-grounded reasoning generates reasoning paths that explicitly link abstract indicators to concrete design regions (represented as bbox coordinates). (B) Example highlighting the difference between non-reasoning models, generic reasoning models, and our indicator-grounded reasoning model.
  • Figure 3: Results for model variants using different input components.
  • Figure 4: Diverse design samples sourced from the Crello dataset. The collection demonstrates a wide spectrum of visual styles and structural layouts, including (a) minimalist typography-centric designs (e.g., "Corporate Charity"), (b) photography-driven fashion editorials featuring real human subjects, (c) vintage illustrations, (d) photorealistic tech mockups, (e) geometric abstract art, (f) textured artistic typography, and (g) cyberpunk-themed certificates. This visual evidence refutes the concern of stylistic homogeneity, confirming the dataset's robust coverage across design domains.
  • Figure 5: Representative samples from real-world flawed cases collected by professional designers. Unlike the synthetically perturbed benchmark, these designs were curated by professional designers to represent authentic aesthetic defects encountered in real-world workflows. The samples exhibit flaws such as (a) visual clutter and inconsistent orientation (e.g., "World AIDS Day"), (b) spatial imbalance and disconnected elements (e.g., "Startup Job Fair"), (c) grid-based alignment errors (e.g., "School Magazine"), and (d) typographic obstruction (e.g., "End Violence"). This dataset serves as a rigorous Out-Of-Distribution benchmark to evaluate model generalization beyond synthetic patterns.