Table of Contents
Fetching ...

CreBench: Human-Aligned Creativity Evaluation from Idea to Process to Product

Kaiwen Xue, Chenglong Li, Zhonghong Ou, Guoxin Zhang, Kaoyan Lu, Shuai Lyu, Yifan Zhu, Ping Zong Junpeng Ding, Xinyu Liu, Qunlin Chen, Weiwei Qin, Yiran Shen, Jiayi Cen

TL;DR

CreBench introduces a human-aligned, multi-dimensional creativity benchmark for multimodal models, addressing the abstraction of human creativity across idea, process, and product. It pairs CreMIT, a large-scale multimodal instruction-tuning dataset generated from expert evaluations and GPT-4o prompts, with CreExpert, a fine-tuned open-source MLLM that achieves superior alignment with human creativity judgments compared to state-of-the-art closed and open models. The approach demonstrates that task- and dimension-specific instruction tuning can significantly improve creativity assessment capabilities in multimodal models, as shown by extensive ablations across idea, process, and product evaluations. By releasing CreBench, CreMIT, and CreExpert as open resources, the work provides a practical foundation for building and benchmarking human-aligned creativity in real-world, open-ended tasks.

Abstract

Human-defined creativity is highly abstract, posing a challenge for multimodal large language models (MLLMs) to comprehend and assess creativity that aligns with human judgments. The absence of an existing benchmark further exacerbates this dilemma. To this end, we propose CreBench, which consists of two key components: 1) an evaluation benchmark covering the multiple dimensions from creative idea to process to products; 2) CreMIT (Creativity Multimodal Instruction Tuning dataset), a multimodal creativity evaluation dataset, consisting of 2.2K diverse-sourced multimodal data, 79.2K human feedbacks and 4.7M multi-typed instructions. Specifically, to ensure MLLMs can handle diverse creativity-related queries, we prompt GPT to refine these human feedbacks to activate stronger creativity assessment capabilities. CreBench serves as a foundation for building MLLMs that understand human-aligned creativity. Based on the CreBench, we fine-tune open-source general MLLMs, resulting in CreExpert, a multimodal creativity evaluation expert model. Extensive experiments demonstrate that the proposed CreExpert models achieve significantly better alignment with human creativity evaluation compared to state-of-the-art MLLMs, including the most advanced GPT-4V and Gemini-Pro-Vision.

CreBench: Human-Aligned Creativity Evaluation from Idea to Process to Product

TL;DR

CreBench introduces a human-aligned, multi-dimensional creativity benchmark for multimodal models, addressing the abstraction of human creativity across idea, process, and product. It pairs CreMIT, a large-scale multimodal instruction-tuning dataset generated from expert evaluations and GPT-4o prompts, with CreExpert, a fine-tuned open-source MLLM that achieves superior alignment with human creativity judgments compared to state-of-the-art closed and open models. The approach demonstrates that task- and dimension-specific instruction tuning can significantly improve creativity assessment capabilities in multimodal models, as shown by extensive ablations across idea, process, and product evaluations. By releasing CreBench, CreMIT, and CreExpert as open resources, the work provides a practical foundation for building and benchmarking human-aligned creativity in real-world, open-ended tasks.

Abstract

Human-defined creativity is highly abstract, posing a challenge for multimodal large language models (MLLMs) to comprehend and assess creativity that aligns with human judgments. The absence of an existing benchmark further exacerbates this dilemma. To this end, we propose CreBench, which consists of two key components: 1) an evaluation benchmark covering the multiple dimensions from creative idea to process to products; 2) CreMIT (Creativity Multimodal Instruction Tuning dataset), a multimodal creativity evaluation dataset, consisting of 2.2K diverse-sourced multimodal data, 79.2K human feedbacks and 4.7M multi-typed instructions. Specifically, to ensure MLLMs can handle diverse creativity-related queries, we prompt GPT to refine these human feedbacks to activate stronger creativity assessment capabilities. CreBench serves as a foundation for building MLLMs that understand human-aligned creativity. Based on the CreBench, we fine-tune open-source general MLLMs, resulting in CreExpert, a multimodal creativity evaluation expert model. Extensive experiments demonstrate that the proposed CreExpert models achieve significantly better alignment with human creativity evaluation compared to state-of-the-art MLLMs, including the most advanced GPT-4V and Gemini-Pro-Vision.

Paper Structure

This paper contains 35 sections, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Overview of CreBench. (a) We design a diverse set of creative tasks and build a multi-dimensional database. (b) We use GPT-4o to generate instruction-following data through prompting. (c) Performance of the proposed CreExpert on various creativity evaluation dimensions. (*) indicates scaled data for better view.
  • Figure 2: Overview of CreMIT construction procedure. Stage 1: We collect diverse solutions (creative idea, creative process, and creative product) generated by students and AI based on open-ended creativity tasks. Stage 2: Innovation experts evaluate each solution across 12 indicators spanning three dimensions, producing detailed assessment reports. Stage 3:Using six types of prompts, we employ GPT-4o to generate multidimensional instruction-following data based on the expert feedback.
  • Figure 3: Data distribution and visualization. (a) We analyze the sample number for each task and promt number of each prompt category. (b) We only demonstrate first sample of the idea prompt (among idea, process and product prompt).