Table of Contents
Fetching ...

What Shapes a Creative Machine Mind? Comprehensively Benchmarking Creativity in Foundation Models

Zicong He, Boxuan Zhang, Weihao Liu, Ruixiang Tang, Lu Cheng

TL;DR

The paper tackles the problem of evaluating creativity in foundation models, arguing that accuracy alone fails to capture innovation. It introduces C^2-Eval, a unified benchmark that separately assesses convergent (structured, constrained) and divergent (open-ended) creativity using a unified $U$, $O$, and $S$ rubric, with a composite score $C = (U+O+S)/3$. The authors benchmark over 20 proprietary and open-source models across text and vision-language tasks, revealing that creativity does not scale monotonically with model size and that reasoning capabilities and carefully designed creative Instructions can substantially boost creative outputs. They also explore the relationship between convergent and divergent creativity, revealing nuanced trade-offs and task-dependent profiles, and provide a reproducible evaluation pipeline to guide future research in creative AI. The work has practical significance for designing, evaluating, and steering the development of creative foundation models in a responsible and interpretable way.

Abstract

The meteoric rise of foundation models (FMs) has expanded their capabilities far beyond conventional tasks. Creativity, long regarded as a hallmark of human intelligence and a driver of innovation, is now increasingly recognized as a critical dimension of machine intelligence in the era of generative FMs, complementing traditional measures of accuracy. However, existing evaluation frameworks for creativity remain fragmented, relying on ad hoc metrics not firmly grounded in established theories. To address this gap, we introduce C^2-Eval, a holistic benchmark for unified assessment of creativity in FMs. C^2-Eval distinguishes between two complementary forms of creativity: convergent creativity, where tasks admit constrained solutions (e.g., code generation), and divergent creativity, where tasks are open-ended (e.g., storytelling). It evaluates both dimensions using fine-grained criteria derived from social-science theory, focusing on Usefulness, Originality, and Surprise (U-O-S). Through extensive experiments on leading proprietary and open-source models, we analyze trade-offs in their creative capabilities. Our results highlight both the strengths and challenges of current FMs in pursuing a creative machine mind, showing that C^2-Eval is an effective lens for examining the evolving landscape of creative AI.

What Shapes a Creative Machine Mind? Comprehensively Benchmarking Creativity in Foundation Models

TL;DR

The paper tackles the problem of evaluating creativity in foundation models, arguing that accuracy alone fails to capture innovation. It introduces C^2-Eval, a unified benchmark that separately assesses convergent (structured, constrained) and divergent (open-ended) creativity using a unified , , and rubric, with a composite score . The authors benchmark over 20 proprietary and open-source models across text and vision-language tasks, revealing that creativity does not scale monotonically with model size and that reasoning capabilities and carefully designed creative Instructions can substantially boost creative outputs. They also explore the relationship between convergent and divergent creativity, revealing nuanced trade-offs and task-dependent profiles, and provide a reproducible evaluation pipeline to guide future research in creative AI. The work has practical significance for designing, evaluating, and steering the development of creative foundation models in a responsible and interpretable way.

Abstract

The meteoric rise of foundation models (FMs) has expanded their capabilities far beyond conventional tasks. Creativity, long regarded as a hallmark of human intelligence and a driver of innovation, is now increasingly recognized as a critical dimension of machine intelligence in the era of generative FMs, complementing traditional measures of accuracy. However, existing evaluation frameworks for creativity remain fragmented, relying on ad hoc metrics not firmly grounded in established theories. To address this gap, we introduce C^2-Eval, a holistic benchmark for unified assessment of creativity in FMs. C^2-Eval distinguishes between two complementary forms of creativity: convergent creativity, where tasks admit constrained solutions (e.g., code generation), and divergent creativity, where tasks are open-ended (e.g., storytelling). It evaluates both dimensions using fine-grained criteria derived from social-science theory, focusing on Usefulness, Originality, and Surprise (U-O-S). Through extensive experiments on leading proprietary and open-source models, we analyze trade-offs in their creative capabilities. Our results highlight both the strengths and challenges of current FMs in pursuing a creative machine mind, showing that C^2-Eval is an effective lens for examining the evolving landscape of creative AI.

Paper Structure

This paper contains 30 sections, 1 equation, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of our $\text{C}^2$-Eval framework. Left panel: the two distinct creativity regimes: convergent, which focuses on structured tasks like question answering and code generation, and divergent, which evaluates open-ended tasks such as story telling and idea generation. Right panel: the U-O-S components of creativity used to evaluate responses across both regimes.
  • Figure 2: Illustration of Sample evaluation cases within our $\text{C}^2$-Eval framework. The top panel presents examples of convergent creativity tasks, including Textual QA, Code Generation, and Visual QA, where an Autograder assesses the U-O-S triplet. The bottom panel displays divergent creativity tasks, such as Story Telling, Idea Generation, and Visual Story Telling, where an advanced LLM judges responses based on the same U-O-S components. Details of Autograder and LLM-as-a-Judge we used here can be found in Section \ref{['sec:metrics']}.
  • Figure 3: Reasoning vs. non-reasoning models of the same size. Left (a): Qwen3-8B in thinking mode versus Qwen3-8B in non-thinking mode. Right (b): Qwen2.5-32B vs QwQ32B.
  • Figure 4: Radar chart illustrates the multi-dimensional components of creativity. The eight axes in each sub-figure represent four indicators (usefulness, originality, surprise, creativity) under both divergent (Div.) and convergent creativity (Conv.). All values here are normalized to (0,1).
  • Figure 5: Ablation study on the effectiveness of creative instructions. We can see that adding creative instructions can generally boost both convergent and divergent creativity of FMs.