CreBench: Human-Aligned Creativity Evaluation from Idea to Process to Product
Kaiwen Xue, Chenglong Li, Zhonghong Ou, Guoxin Zhang, Kaoyan Lu, Shuai Lyu, Yifan Zhu, Ping Zong Junpeng Ding, Xinyu Liu, Qunlin Chen, Weiwei Qin, Yiran Shen, Jiayi Cen
TL;DR
CreBench introduces a human-aligned, multi-dimensional creativity benchmark for multimodal models, addressing the abstraction of human creativity across idea, process, and product. It pairs CreMIT, a large-scale multimodal instruction-tuning dataset generated from expert evaluations and GPT-4o prompts, with CreExpert, a fine-tuned open-source MLLM that achieves superior alignment with human creativity judgments compared to state-of-the-art closed and open models. The approach demonstrates that task- and dimension-specific instruction tuning can significantly improve creativity assessment capabilities in multimodal models, as shown by extensive ablations across idea, process, and product evaluations. By releasing CreBench, CreMIT, and CreExpert as open resources, the work provides a practical foundation for building and benchmarking human-aligned creativity in real-world, open-ended tasks.
Abstract
Human-defined creativity is highly abstract, posing a challenge for multimodal large language models (MLLMs) to comprehend and assess creativity that aligns with human judgments. The absence of an existing benchmark further exacerbates this dilemma. To this end, we propose CreBench, which consists of two key components: 1) an evaluation benchmark covering the multiple dimensions from creative idea to process to products; 2) CreMIT (Creativity Multimodal Instruction Tuning dataset), a multimodal creativity evaluation dataset, consisting of 2.2K diverse-sourced multimodal data, 79.2K human feedbacks and 4.7M multi-typed instructions. Specifically, to ensure MLLMs can handle diverse creativity-related queries, we prompt GPT to refine these human feedbacks to activate stronger creativity assessment capabilities. CreBench serves as a foundation for building MLLMs that understand human-aligned creativity. Based on the CreBench, we fine-tune open-source general MLLMs, resulting in CreExpert, a multimodal creativity evaluation expert model. Extensive experiments demonstrate that the proposed CreExpert models achieve significantly better alignment with human creativity evaluation compared to state-of-the-art MLLMs, including the most advanced GPT-4V and Gemini-Pro-Vision.
