Table of Contents
Fetching ...

DesignProbe: A Graphic Design Benchmark for Multimodal Large Language Models

Jieru Lin, Danqing Huang, Tiejun Zhao, Dechen Zhan, Chin-Yew Lin

TL;DR

DesignProbe addresses the gap in evaluating graphic design understanding by multimodal LLMs. It proposes an eight-task benchmark spanning design element recognition and semantic tasks as well as overall design perception, evaluated with GPT-4 as an automatic grader. The study demonstrates that prompt refinement and image-based knowledge augmentation improve performance, with GPT-4 Vision achieving the strongest results but still not passing a 60% threshold. This benchmark provides a domain-specific, scalable evaluation tool and insights that guide future research on design-aware AI systems.

Abstract

A well-executed graphic design typically achieves harmony in two levels, from the fine-grained design elements (color, font and layout) to the overall design. This complexity makes the comprehension of graphic design challenging, for it needs the capability to both recognize the design elements and understand the design. With the rapid development of Multimodal Large Language Models (MLLMs), we establish the DesignProbe, a benchmark to investigate the capability of MLLMs in design. Our benchmark includes eight tasks in total, across both the fine-grained element level and the overall design level. At design element level, we consider both the attribute recognition and semantic understanding tasks. At overall design level, we include style and metaphor. 9 MLLMs are tested and we apply GPT-4 as evaluator. Besides, further experiments indicates that refining prompts can enhance the performance of MLLMs. We first rewrite the prompts by different LLMs and found increased performances appear in those who self-refined by their own LLMs. We then add extra task knowledge in two different ways (text descriptions and image examples), finding that adding images boost much more performance over texts.

DesignProbe: A Graphic Design Benchmark for Multimodal Large Language Models

TL;DR

DesignProbe addresses the gap in evaluating graphic design understanding by multimodal LLMs. It proposes an eight-task benchmark spanning design element recognition and semantic tasks as well as overall design perception, evaluated with GPT-4 as an automatic grader. The study demonstrates that prompt refinement and image-based knowledge augmentation improve performance, with GPT-4 Vision achieving the strongest results but still not passing a 60% threshold. This benchmark provides a domain-specific, scalable evaluation tool and insights that guide future research on design-aware AI systems.

Abstract

A well-executed graphic design typically achieves harmony in two levels, from the fine-grained design elements (color, font and layout) to the overall design. This complexity makes the comprehension of graphic design challenging, for it needs the capability to both recognize the design elements and understand the design. With the rapid development of Multimodal Large Language Models (MLLMs), we establish the DesignProbe, a benchmark to investigate the capability of MLLMs in design. Our benchmark includes eight tasks in total, across both the fine-grained element level and the overall design level. At design element level, we consider both the attribute recognition and semantic understanding tasks. At overall design level, we include style and metaphor. 9 MLLMs are tested and we apply GPT-4 as evaluator. Besides, further experiments indicates that refining prompts can enhance the performance of MLLMs. We first rewrite the prompts by different LLMs and found increased performances appear in those who self-refined by their own LLMs. We then add extra task knowledge in two different ways (text descriptions and image examples), finding that adding images boost much more performance over texts.
Paper Structure (15 sections, 5 figures, 3 tables)

This paper contains 15 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The performance of 5 MLLMs at overall design level and design element level (color, font and layout) in DesignProbe.
  • Figure 2: Overview of our benchmark. It comprises a total of eight tasks to evaluate the proficiency of MLLM in design. The assessment occurs on two distinct levels: the element level and the overall design level. At the element level, it focus on three fundamental design components: color, font and layout. For each, both visual and semantic aspects are included. Each task is presented with an example.
  • Figure 3: The examples of adding example into prompt.
  • Figure 4: The experiment results (%) of adding additional different types of information to the questions. Ori in green represents the performance under original questions in DesignProbe. + test in yellow represents adding text description to the questions. + concated image in pink represents combining multiple image examples into one image due to the unsupportment of multiple images input in LLaVA. + image means adding multiple image examples.
  • Figure 5: Error cases of overall design level tasks. In case 1, the model fails to recognize the creative use of record. In case2, the model fails to recognize the abstract represent of theater seats in car.