Rethinking Creativity Evaluation: A Critical Analysis of Existing Creativity Evaluations
Li-Chun Lu, Miri Liu, Pin-Chun Lu, Yufei Tian, Shao-Hua Sun, Nanyun Peng
TL;DR
The paper critically evaluates four automated creativity metrics—Creativity Index, Perplexity, Syntactic Templates, and LLM-as-a-Judge—across three domains (creative writing, unconventional problem-solving, and research ideation) using curated datasets and human judgments. It finds limited cross-metric consistency: CI mainly tracks lexical diversity rather than conceptual originality, PPL correlates with model confidence rather than novelty, syntactic templates capture structure but miss ideas, and LLM-based judgments are biased and unstable. The authors propose more robust, generalizable evaluation frameworks that align better with human creativity judgments and demonstrate that chain-of-thought and rubric-based prompting can partly ameliorate LLM-evaluator weaknesses. Overall, the work provides diagnostic guidance and actionable directions for developing domain-aware, concept-level creativity assessments beyond surface-level patterns.
Abstract
We systematically examine, analyze, and compare representative creativity measures--creativity index, perplexity, syntactic templates, and LLM-as-a-Judge--across diverse creative domains, including creative writing, unconventional problem-solving, and research ideation. Our analyses reveal that these metrics exhibit limited consistency, capturing different dimensions of creativity. We highlight key limitations, including the creativity index's focus on lexical diversity, perplexity's sensitivity to model confidence, and syntactic templates' inability to capture conceptual creativity. Additionally, LLM-as-a-Judge shows instability and bias. Our findings underscore the need for more robust, generalizable evaluation frameworks that better align with human judgments of creativity.
