Table of Contents
Fetching ...

Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM

Xinyu Fang, Zhijian Chen, Kai Lan, Lixin Ma, Shengyuan Ding, Yingji Liang, Xiangyu Zhao, Farong Wen, Zicheng Zhang, Guofeng Zhang, Haodong Duan, Kai Chen, Dahua Lin

TL;DR

Creation-MMBench introduces a dedicated multimodal benchmark to evaluate context-aware creative intelligence in MLLMs, addressing a gap in visual creativity assessment. It combines 765 test cases across 51 tasks with instance-specific criteria and an MLLM-as-a-judge framework using GPT-4o, plus a text-only variant Creation-MMBench-TO. Empirical results show open-source MLLMs lag behind proprietary models in creative tasks and highlight a potential trade-off where visual instruction tuning can harm creativity. The benchmark provides a practical, scalable platform to study multimodal generative creativity and guide future improvements.

Abstract

Creativity is a fundamental aspect of intelligence, involving the ability to generate novel and appropriate solutions across diverse contexts. While Large Language Models (LLMs) have been extensively evaluated for their creative capabilities, the assessment of Multimodal Large Language Models (MLLMs) in this domain remains largely unexplored. To address this gap, we introduce Creation-MMBench, a multimodal benchmark specifically designed to evaluate the creative capabilities of MLLMs in real-world, image-based tasks. The benchmark comprises 765 test cases spanning 51 fine-grained tasks. To ensure rigorous evaluation, we define instance-specific evaluation criteria for each test case, guiding the assessment of both general response quality and factual consistency with visual inputs. Experimental results reveal that current open-source MLLMs significantly underperform compared to proprietary models in creative tasks. Furthermore, our analysis demonstrates that visual fine-tuning can negatively impact the base LLM's creative abilities. Creation-MMBench provides valuable insights for advancing MLLM creativity and establishes a foundation for future improvements in multimodal generative intelligence. Full data and evaluation code is released on https://github.com/open-compass/Creation-MMBench.

Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM

TL;DR

Creation-MMBench introduces a dedicated multimodal benchmark to evaluate context-aware creative intelligence in MLLMs, addressing a gap in visual creativity assessment. It combines 765 test cases across 51 tasks with instance-specific criteria and an MLLM-as-a-judge framework using GPT-4o, plus a text-only variant Creation-MMBench-TO. Empirical results show open-source MLLMs lag behind proprietary models in creative tasks and highlight a potential trade-off where visual instruction tuning can harm creativity. The benchmark provides a practical, scalable platform to study multimodal generative creativity and guide future improvements.

Abstract

Creativity is a fundamental aspect of intelligence, involving the ability to generate novel and appropriate solutions across diverse contexts. While Large Language Models (LLMs) have been extensively evaluated for their creative capabilities, the assessment of Multimodal Large Language Models (MLLMs) in this domain remains largely unexplored. To address this gap, we introduce Creation-MMBench, a multimodal benchmark specifically designed to evaluate the creative capabilities of MLLMs in real-world, image-based tasks. The benchmark comprises 765 test cases spanning 51 fine-grained tasks. To ensure rigorous evaluation, we define instance-specific evaluation criteria for each test case, guiding the assessment of both general response quality and factual consistency with visual inputs. Experimental results reveal that current open-source MLLMs significantly underperform compared to proprietary models in creative tasks. Furthermore, our analysis demonstrates that visual fine-tuning can negatively impact the base LLM's creative abilities. Creation-MMBench provides valuable insights for advancing MLLM creativity and establishes a foundation for future improvements in multimodal generative intelligence. Full data and evaluation code is released on https://github.com/open-compass/Creation-MMBench.

Paper Structure

This paper contains 12 sections, 2 equations, 33 figures, 7 tables.

Figures (33)

  • Figure 1: Brain regions related to creativity and their respective functions heilman2016possiblegao2021subcortical.
  • Figure 2: Overview of Creation-MMBench. Contains four task categories, each category consists of multiple tasks, and the types of images are diverse. Only a few representative tasks of each category are shown here. Complete list of tasks is detailed in the Appendix A.
  • Figure 3: Evaluation Result of MLLMs w/o visual input.
  • Figure 4: Statistics and Cases of Creation-MMBench. Compared to other widely used MLLM benchmarks, Creation-MMBench features a more comprehensive query design to capture abundant creative contexts. Diverse roles are introduced into the queries to stimulate MLLMs' utilization of disciplinary and prior knowledge. As an MLLM benchmark, Creation-MMBench includes a rich variety of images to thoroughly evaluate multiple capabilities of MLLMs.
  • Figure 5: Comparing OC Score and Creation-MMBench Reward. This figure shows the model performance on the OpenVLM Leaderboard and Creation-MMBench, highlighting a significant gap between objective performance and visual creativity in some open-source models.
  • ...and 28 more figures