Table of Contents
Fetching ...

Graphic-Design-Bench: A Comprehensive Benchmark for Evaluating AI on Graphic Design Tasks

Adrienne Deganutti, Elad Hirsch, Haonan Zhu, Jaejung Seol, Purvanshi Mehta

Abstract

We introduce GraphicDesignBench (GDB), the first comprehensive benchmark suite designed specifically to evaluate AI models on the full breadth of professional graphic design tasks. Unlike existing benchmarks that focus on natural-image understanding or generic text-to-image synthesis, GDB targets the unique challenges of professional design work: translating communicative intent into structured layouts, rendering typographically faithful text, manipulating layered compositions, producing valid vector graphics, and reasoning about animation. The suite comprises 50 tasks organized along five axes: layout, typography, infographics, template & design semantics and animation, each evaluated under both understanding and generation settings, and grounded in real-world design templates drawn from the LICA layered-composition dataset. We evaluate a set of frontier closed-source models using a standardized metric taxonomy covering spatial accuracy, perceptual quality, text fidelity, semantic alignment, and structural validity. Our results reveal that current models fall short on the core challenges of professional design: spatial reasoning over complex layouts, faithful vector code generation, fine-grained typographic perception, and temporal decomposition of animations remain largely unsolved. While high-level semantic understanding is within reach, the gap widens sharply as tasks demand precision, structure, and compositional awareness. GDB provides a rigorous, reproducible testbed for tracking progress toward AI systems that can function as capable design collaborators. The full evaluation framework is publicly available.

Graphic-Design-Bench: A Comprehensive Benchmark for Evaluating AI on Graphic Design Tasks

Abstract

We introduce GraphicDesignBench (GDB), the first comprehensive benchmark suite designed specifically to evaluate AI models on the full breadth of professional graphic design tasks. Unlike existing benchmarks that focus on natural-image understanding or generic text-to-image synthesis, GDB targets the unique challenges of professional design work: translating communicative intent into structured layouts, rendering typographically faithful text, manipulating layered compositions, producing valid vector graphics, and reasoning about animation. The suite comprises 50 tasks organized along five axes: layout, typography, infographics, template & design semantics and animation, each evaluated under both understanding and generation settings, and grounded in real-world design templates drawn from the LICA layered-composition dataset. We evaluate a set of frontier closed-source models using a standardized metric taxonomy covering spatial accuracy, perceptual quality, text fidelity, semantic alignment, and structural validity. Our results reveal that current models fall short on the core challenges of professional design: spatial reasoning over complex layouts, faithful vector code generation, fine-grained typographic perception, and temporal decomposition of animations remain largely unsolved. While high-level semantic understanding is within reach, the gap widens sharply as tasks demand precision, structure, and compositional awareness. GDB provides a rigorous, reproducible testbed for tracking progress toward AI systems that can function as capable design collaborators. The full evaluation framework is publicly available.

Paper Structure

This paper contains 62 sections, 26 figures, 66 tables.

Figures (26)

  • Figure 1: LICA samples hirsch2026lica. Design layouts with structured, component-level annotations capturing full hierarchy and rich metadata beyond coarse bounding boxes, on which we benchmark models in this report.
  • Figure 2: Example design templates illustrating the variance of layout properties evaluated in this section. The layouts differ in aspect ratio, number and type of components, spatial composition, and visual complexity (a), and may contain rotated image elements (b) or images placed inside decorative frames with non-rectangular crops (c).
  • Figure 3: Representative failure cases for layout understanding tasks. Models exhibit systematic errors including ratio confusion (a), order-of-magnitude overcounting (b), and type collapse (c).
  • Figure 4: Layer order prediction failure case (GPT-5.4). The model predicts an incorrect z-order that buries foreground elements (text and decorative shape) behind the background, illustrating that even small rank errors can render a layout unusable.
  • Figure 5: Partial layout completion visualization. From left to right, we show the composite input image, the multiple target assets with a checkerboard background, and the predictions from three models: Gemini, Opus, and GPT. Both rows illustrate the multiple-placement setting. In the first row, the models fail to account for the cropped nature of the object asset, leading to visually unnatural placements of truncated content, as indicated by the orange arrows. In the second row, the models fail to preserve the shared visual concept of related text elements such as “Jobs” and “About,” placing them too far apart or in the incorrect order, and potentially cropping or misplacing other shapes (GPT-5.4).
  • ...and 21 more figures