What Shapes a Creative Machine Mind? Comprehensively Benchmarking Creativity in Foundation Models
Zicong He, Boxuan Zhang, Weihao Liu, Ruixiang Tang, Lu Cheng
TL;DR
The paper tackles the problem of evaluating creativity in foundation models, arguing that accuracy alone fails to capture innovation. It introduces C^2-Eval, a unified benchmark that separately assesses convergent (structured, constrained) and divergent (open-ended) creativity using a unified $U$, $O$, and $S$ rubric, with a composite score $C = (U+O+S)/3$. The authors benchmark over 20 proprietary and open-source models across text and vision-language tasks, revealing that creativity does not scale monotonically with model size and that reasoning capabilities and carefully designed creative Instructions can substantially boost creative outputs. They also explore the relationship between convergent and divergent creativity, revealing nuanced trade-offs and task-dependent profiles, and provide a reproducible evaluation pipeline to guide future research in creative AI. The work has practical significance for designing, evaluating, and steering the development of creative foundation models in a responsible and interpretable way.
Abstract
The meteoric rise of foundation models (FMs) has expanded their capabilities far beyond conventional tasks. Creativity, long regarded as a hallmark of human intelligence and a driver of innovation, is now increasingly recognized as a critical dimension of machine intelligence in the era of generative FMs, complementing traditional measures of accuracy. However, existing evaluation frameworks for creativity remain fragmented, relying on ad hoc metrics not firmly grounded in established theories. To address this gap, we introduce C^2-Eval, a holistic benchmark for unified assessment of creativity in FMs. C^2-Eval distinguishes between two complementary forms of creativity: convergent creativity, where tasks admit constrained solutions (e.g., code generation), and divergent creativity, where tasks are open-ended (e.g., storytelling). It evaluates both dimensions using fine-grained criteria derived from social-science theory, focusing on Usefulness, Originality, and Surprise (U-O-S). Through extensive experiments on leading proprietary and open-source models, we analyze trade-offs in their creative capabilities. Our results highlight both the strengths and challenges of current FMs in pursuing a creative machine mind, showing that C^2-Eval is an effective lens for examining the evolving landscape of creative AI.
