Table of Contents
Fetching ...

PosterReward: Unlocking Accurate Evaluation for High-Quality Graphic Design Generation

Jianyu Lai, Sixiang Chen, Jialin Gao, Hengyu Shi, Zhongying Liu, Fuxiang Zhai, Junfeng Luo, Xiaoming Wei, Lujia Wang, Lei Zhu

Abstract

Recent advancements in the text-rendering capabilities of image generation models have made the end-to-end creation of graphic design content, such as posters, increasingly feasible. However, existing reward models fall short of accurately assessing design quality, as they primarily focus on global image aesthetics while overlooking the critical dimensions of typography and layout. Furthermore, the scarcity of domain-specific preference data remains a significant bottleneck, which limits the further development of graphic design evaluation and generation. To bridge this gap, we introduce an automated pipeline to construct a high-quality dataset of 70k poster preferences by leveraging the consensus of multiple Multi-modal Large Language Models (MLLMs) to simulate human-like judgment. Utilizing this dataset, we develop PosterReward, a reward model specifically designed for high-precision poster assessment through a cascaded, multi-stage training strategy. We also provide multiple variants of the model to cater to different application scenarios. Finally, we introduce PosterRewardBench and PosterBench to evaluate the performance of existing reward models in poster assessment and the generation capabilities of current text-to-image models in poster creation, respectively.

PosterReward: Unlocking Accurate Evaluation for High-Quality Graphic Design Generation

Abstract

Recent advancements in the text-rendering capabilities of image generation models have made the end-to-end creation of graphic design content, such as posters, increasingly feasible. However, existing reward models fall short of accurately assessing design quality, as they primarily focus on global image aesthetics while overlooking the critical dimensions of typography and layout. Furthermore, the scarcity of domain-specific preference data remains a significant bottleneck, which limits the further development of graphic design evaluation and generation. To bridge this gap, we introduce an automated pipeline to construct a high-quality dataset of 70k poster preferences by leveraging the consensus of multiple Multi-modal Large Language Models (MLLMs) to simulate human-like judgment. Utilizing this dataset, we develop PosterReward, a reward model specifically designed for high-precision poster assessment through a cascaded, multi-stage training strategy. We also provide multiple variants of the model to cater to different application scenarios. Finally, we introduce PosterRewardBench and PosterBench to evaluate the performance of existing reward models in poster assessment and the generation capabilities of current text-to-image models in poster creation, respectively.

Paper Structure

This paper contains 28 sections, 10 equations, 21 figures, 7 tables.

Figures (21)

  • Figure 1: PosterReward. PosterReward is a reward model for poster generation tasks. It evaluates posters from multiple dimensions and outputs scores, achieving an accurate assessment of graphic design quality.
  • Figure 2: Schematic diagram of AI preference data collection. The raw data was generated using Seedream 3.0, Seedream 4.0, and Qwen-Image-Lightning. The models used included four open-source models: CLIP, DINOv3, HPSv3, and GLM-4.5v, and three closed-source models: Gemini-2.5-Flash-Lite, Gemini-2.5-Pro, and GPT-5.
  • Figure 3: A schematic diagram of AI preference data samples. In each group of images, the left side represents chosen samples, and the right side represents rejected samples. The orange box at the end of the prompt section below indicates the dimensions used to construct the preference pairs.
  • Figure 4: PosterReward training pipeline and model structure diagram. The top shows three reward models with different structures, and the bottom shows the training pipeline. Our training pipeline consists of four cascaded stages: Joint Supervised Fine-Tuning, Joint Rejection Sampling, Score-Module Training, and Reinforcement Learning.
  • Figure 5: Using the MLLM-as-a-judge method, this figure illustrates our preference analysis for the analysis module via Gemini-3-flash. The results show that the SFT model performs significantly better than the base model, with both joint SFT and joint RSFT contributing to performance gains. We averaged the results of two separate annotations with swapped text positions to eliminate position bias.
  • ...and 16 more figures