Table of Contents
Fetching ...

CREval: An Automated Interpretable Evaluation for Creative Image Manipulation under Complex Instructions

Chonghuinan Wang, Zihan Chen, Yuxiang Wei, Tianyi Jiang, Xiaohe Wu, Fan Li, Wangmeng Zuo, Hongxun Yao

Abstract

Instruction-based multimodal image manipulation has recently made rapid progress. However, existing evaluation methods lack a systematic and human-aligned framework for assessing model performance on complex and creative editing tasks. To address this gap, we propose CREval, a fully automated question-answer (QA)-based evaluation pipeline that overcomes the incompleteness and poor interpretability of opaque Multimodal Large Language Models (MLLMs) scoring. Simultaneously, we introduce CREval-Bench, a comprehensive benchmark specifically designed for creative image manipulation under complex instructions. CREval-Bench covers three categories and nine creative dimensions, comprising over 800 editing samples and 13K evaluation queries. Leveraging this pipeline and benchmark, we systematically evaluate a diverse set of state-of-the-art open and closed-source models. The results reveal that while closed-source models generally outperform open-source ones on complex and creative tasks, all models still struggle to complete such edits effectively. In addition, user studies demonstrate strong consistency between CREval's automated metrics and human judgments. Therefore, CREval provides a reliable foundation for evaluating image editing models on complex and creative image manipulation tasks, and highlights key challenges and opportunities for future research.

CREval: An Automated Interpretable Evaluation for Creative Image Manipulation under Complex Instructions

Abstract

Instruction-based multimodal image manipulation has recently made rapid progress. However, existing evaluation methods lack a systematic and human-aligned framework for assessing model performance on complex and creative editing tasks. To address this gap, we propose CREval, a fully automated question-answer (QA)-based evaluation pipeline that overcomes the incompleteness and poor interpretability of opaque Multimodal Large Language Models (MLLMs) scoring. Simultaneously, we introduce CREval-Bench, a comprehensive benchmark specifically designed for creative image manipulation under complex instructions. CREval-Bench covers three categories and nine creative dimensions, comprising over 800 editing samples and 13K evaluation queries. Leveraging this pipeline and benchmark, we systematically evaluate a diverse set of state-of-the-art open and closed-source models. The results reveal that while closed-source models generally outperform open-source ones on complex and creative tasks, all models still struggle to complete such edits effectively. In addition, user studies demonstrate strong consistency between CREval's automated metrics and human judgments. Therefore, CREval provides a reliable foundation for evaluating image editing models on complex and creative image manipulation tasks, and highlights key challenges and opportunities for future research.

Paper Structure

This paper contains 18 sections, 1 equation, 19 figures, 6 tables.

Figures (19)

  • Figure 1: Evaluation of state-of-the-art image generation and editing models using CREval, with GPT-4o serving as the evaluator. Each edited image is evaluated across three metrics: Instruction Following (IF), Visual Consistency (VC), and Visual Quality (VQ). The results indicate that the complex and creative instructions in CREval-Bench pose substantial challenges for current image manipulation models.
  • Figure 2: Comparison with previous benchmark. The CREval-Bench dataset extends existing instruction-based editing benchmarks by incorporating more complex, creative, and semantically rich instructions. Such design facilitates a comprehensive evaluation of model performance in handling imaginative and complex instruction editing tasks. In (b), the edited image examples on the right correspond one-to-one with the image-instruction pairs on the left.
  • Figure 3: Distribution of creative editing types. Creative types are organized into 3 primary categories and 9 dimensions, with balanced sample counts to ensure comprehensive and consistent evaluation.
  • Figure 4: Overview of CREval. (1) In stage 1, we manually select high-quality images. We then construct several editing instruction examples and utilize the GPT-4o model for few-shot learning across 9 predefined dimensions, generating dimension consistent editing instructions and producing image–instruction pairs. (2) In stage 2, we use these image–instruction pairs to construct evaluation tasks. To reduce bias, we use different MLLMs such as Qwen2.5-VL-72B, to generate evaluation questions for 3 metrics using the Chain-of-Thought (CoT) method. Each metric contains at least 5 questions, with a total of no fewer than 15 questions per pair, completing the construction of the CREval-Bench. (3)In Stage 3, we evaluate mainstream image manipulation models using CREval method. A MLLM model is employed as the evaluator to score each edited image based on evaluation questions. The final performance metric is obtained by computing a weighted average score across all evaluation metrics.
  • Figure 5: Performance comparison across all creative dimensions under different metrics. Top row: closed-source models; bottom row: open-source models.
  • ...and 14 more figures