Table of Contents
Fetching ...

E-comIQ-ZH: A Human-Aligned Dataset and Benchmark for Fine-Grained Evaluation of E-commerce Posters with Chain-of-Thought

Meiqi Sun, Mingyu Li, Junxiong Zhu

TL;DR

E-comIQ-ZH, a framework for evaluating Chinese e-commerce posters, is introduced and the first dataset E-comIQ-18k is built to feature multi dimensional scores and expert calibrated Chain of Thought (CoT) rationales to enable scalable automated assessment of e-commerce posters.

Abstract

Generative AI is widely used to create commercial posters. However, rapid advances in generation have outpaced automated quality assessment. Existing models emphasize generic esthetics or low level distortions and lack the functional criteria required for e-commerce design. It is especially challenging for Chinese content, where complex characters often produce subtle but critical textual artifacts that are overlooked by existing methods. To address this, we introduce E-comIQ-ZH, a framework for evaluating Chinese e-commerce posters. We build the first dataset E-comIQ-18k to feature multi dimensional scores and expert calibrated Chain of Thought (CoT) rationales. Using this dataset, we train E-comIQ-M, a specialized evaluation model that aligns with human expert judgment. Our framework enables E-comIQ-Bench, the first automated and scalable benchmark for the generation of Chinese e-commerce posters. Extensive experiments show our E-comIQ-M aligns more closely with expert standards and enables scalable automated assessment of e-commerce posters. All datasets, models, and evaluation tools will be released to support future research in this area.Code will be available at https://github.com/4mm7/E-comIQ-ZH.

E-comIQ-ZH: A Human-Aligned Dataset and Benchmark for Fine-Grained Evaluation of E-commerce Posters with Chain-of-Thought

TL;DR

E-comIQ-ZH, a framework for evaluating Chinese e-commerce posters, is introduced and the first dataset E-comIQ-18k is built to feature multi dimensional scores and expert calibrated Chain of Thought (CoT) rationales to enable scalable automated assessment of e-commerce posters.

Abstract

Generative AI is widely used to create commercial posters. However, rapid advances in generation have outpaced automated quality assessment. Existing models emphasize generic esthetics or low level distortions and lack the functional criteria required for e-commerce design. It is especially challenging for Chinese content, where complex characters often produce subtle but critical textual artifacts that are overlooked by existing methods. To address this, we introduce E-comIQ-ZH, a framework for evaluating Chinese e-commerce posters. We build the first dataset E-comIQ-18k to feature multi dimensional scores and expert calibrated Chain of Thought (CoT) rationales. Using this dataset, we train E-comIQ-M, a specialized evaluation model that aligns with human expert judgment. Our framework enables E-comIQ-Bench, the first automated and scalable benchmark for the generation of Chinese e-commerce posters. Extensive experiments show our E-comIQ-M aligns more closely with expert standards and enables scalable automated assessment of e-commerce posters. All datasets, models, and evaluation tools will be released to support future research in this area.Code will be available at https://github.com/4mm7/E-comIQ-ZH.
Paper Structure (49 sections, 2 equations, 21 figures, 12 tables, 1 algorithm)

This paper contains 49 sections, 2 equations, 21 figures, 12 tables, 1 algorithm.

Figures (21)

  • Figure 1: Qualitative comparison of E-comIQ-M with leading MLLMs on a challenging e-commerce image. While other powerful models like Gemini 2.5 Pro comanici2025gemini and Q-Insight li2025q overlook critical flaws, our E-comIQ-M accurately identifies the subtle stroke-level corruption. This leads to a more human-aligned low score for the text dimension (1.0), demonstrating its superior fine-grained diagnostic capabilities.
  • Figure 2: Overview of the E-comIQ-ZH framework. (a–c) E-comIQ-Dataset: multi-dimensional expert annotations with Chain-of-Thought rationales. (d–e) E-comIQ-M: two-stage training via Supervised Fine-Tuning (SFT) and Generative Reranking Policy Optimization (GRPO). (f) E-comIQ-Bench: evaluation of generative models on e-commerce image generation capabilities.
  • Figure 3: An illustration of our human-AI collaborative pipeline for generating diagnostic Chain-of-Thought (CoT) rationales.
  • Figure 4: Distribution of image sources.
  • Figure 5: Statistical Profile of E-comIQ-18k. (a) The multi-modal distribution of overall scores highlights sample diversity. (b) The distribution of CoT rationale lengths . (c) The correlation matrix reveals a semi-orthogonal dimensional structure. (d) A 'weakest link' analysis pinpoints common diagnostic challenges.
  • ...and 16 more figures