Table of Contents
Fetching ...

BizGenEval: A Systematic Benchmark for Commercial Visual Content Generation

Yan Li, Zezi Zeng, Ziwei Zhou, Xin Gao, Muzhao Tian, Yifan Yang, Mingxi Cheng, Qi Dai, Yuqing Yang, Lili Qiu, Zhendong Wang, Zhengyuan Yang, Xue Yang, Lijuan Wang, Ji Li, Chong Luo

Abstract

Recent advances in image generation models have expanded their applications beyond aesthetic imagery toward practical visual content creation. However, existing benchmarks mainly focus on natural image synthesis and fail to systematically evaluate models under the structured and multi-constraint requirements of real-world commercial design tasks. In this work, we introduce BizGenEval, a systematic benchmark for commercial visual content generation. The benchmark spans five representative document types: slides, charts, webpages, posters, and scientific figures, and evaluates four key capability dimensions: text rendering, layout control, attribute binding, and knowledge-based reasoning, forming 20 diverse evaluation tasks. BizGenEval contains 400 carefully curated prompts and 8000 human-verified checklist questions to rigorously assess whether generated images satisfy complex visual and semantic constraints. We conduct large-scale benchmarking on 26 popular image generation systems, including state-of-the-art commercial APIs and leading open-source models. The results reveal substantial capability gaps between current generative models and the requirements of professional visual content creation. We hope BizGenEval serves as a standardized benchmark for real-world commercial visual content generation.

BizGenEval: A Systematic Benchmark for Commercial Visual Content Generation

Abstract

Recent advances in image generation models have expanded their applications beyond aesthetic imagery toward practical visual content creation. However, existing benchmarks mainly focus on natural image synthesis and fail to systematically evaluate models under the structured and multi-constraint requirements of real-world commercial design tasks. In this work, we introduce BizGenEval, a systematic benchmark for commercial visual content generation. The benchmark spans five representative document types: slides, charts, webpages, posters, and scientific figures, and evaluates four key capability dimensions: text rendering, layout control, attribute binding, and knowledge-based reasoning, forming 20 diverse evaluation tasks. BizGenEval contains 400 carefully curated prompts and 8000 human-verified checklist questions to rigorously assess whether generated images satisfy complex visual and semantic constraints. We conduct large-scale benchmarking on 26 popular image generation systems, including state-of-the-art commercial APIs and leading open-source models. The results reveal substantial capability gaps between current generative models and the requirements of professional visual content creation. We hope BizGenEval serves as a standardized benchmark for real-world commercial visual content generation.

Paper Structure

This paper contains 19 sections, 21 figures, 5 tables.

Figures (21)

  • Figure 1: Real-world samples of BizGenEval from 5 content domains and 4 capability dimensions. For each capability dimension, different subtasks are highlighted with blue boxes. Each example covers distinct aspects, with the relevant regions highlighted in red and the corresponding simplified questions listed below.
  • Figure 2: Overview of the construction and evaluation pipeline. The system converts real world references and domain knowledge into structured prompts, and evaluates generated images using rigorous checklists to provide a comprehensive analysis.
  • Figure 3: Dataset Statistics of BizGenEval. (a) Prompt token length distribution by evaluation dimension, with scatter points color-coded by document type. (b) Hierarchical subcategory distribution across evaluation dimensions. The inner ring shows question proportions per dimension; the outer ring shows subcategory breakdowns. (c) Checklist Questions keyword clouds per evaluation dimension.
  • Figure 4: Qualitative evaluation of different commercial image generation models across five content domains. Columns represent content domains, and rows show model outputs. Evaluation questions are listed in the top row, with correct and incorrect regions highlighted in blue and red boxes.
  • Figure 5: Qualitative evaluation of different commercial image generation models across four capability dimensions.
  • ...and 16 more figures