Table of Contents
Fetching ...

Rethinking Creativity Evaluation: A Critical Analysis of Existing Creativity Evaluations

Li-Chun Lu, Miri Liu, Pin-Chun Lu, Yufei Tian, Shao-Hua Sun, Nanyun Peng

TL;DR

The paper critically evaluates four automated creativity metrics—Creativity Index, Perplexity, Syntactic Templates, and LLM-as-a-Judge—across three domains (creative writing, unconventional problem-solving, and research ideation) using curated datasets and human judgments. It finds limited cross-metric consistency: CI mainly tracks lexical diversity rather than conceptual originality, PPL correlates with model confidence rather than novelty, syntactic templates capture structure but miss ideas, and LLM-based judgments are biased and unstable. The authors propose more robust, generalizable evaluation frameworks that align better with human creativity judgments and demonstrate that chain-of-thought and rubric-based prompting can partly ameliorate LLM-evaluator weaknesses. Overall, the work provides diagnostic guidance and actionable directions for developing domain-aware, concept-level creativity assessments beyond surface-level patterns.

Abstract

We systematically examine, analyze, and compare representative creativity measures--creativity index, perplexity, syntactic templates, and LLM-as-a-Judge--across diverse creative domains, including creative writing, unconventional problem-solving, and research ideation. Our analyses reveal that these metrics exhibit limited consistency, capturing different dimensions of creativity. We highlight key limitations, including the creativity index's focus on lexical diversity, perplexity's sensitivity to model confidence, and syntactic templates' inability to capture conceptual creativity. Additionally, LLM-as-a-Judge shows instability and bias. Our findings underscore the need for more robust, generalizable evaluation frameworks that better align with human judgments of creativity.

Rethinking Creativity Evaluation: A Critical Analysis of Existing Creativity Evaluations

TL;DR

The paper critically evaluates four automated creativity metrics—Creativity Index, Perplexity, Syntactic Templates, and LLM-as-a-Judge—across three domains (creative writing, unconventional problem-solving, and research ideation) using curated datasets and human judgments. It finds limited cross-metric consistency: CI mainly tracks lexical diversity rather than conceptual originality, PPL correlates with model confidence rather than novelty, syntactic templates capture structure but miss ideas, and LLM-based judgments are biased and unstable. The authors propose more robust, generalizable evaluation frameworks that align better with human creativity judgments and demonstrate that chain-of-thought and rubric-based prompting can partly ameliorate LLM-evaluator weaknesses. Overall, the work provides diagnostic guidance and actionable directions for developing domain-aware, concept-level creativity assessments beyond surface-level patterns.

Abstract

We systematically examine, analyze, and compare representative creativity measures--creativity index, perplexity, syntactic templates, and LLM-as-a-Judge--across diverse creative domains, including creative writing, unconventional problem-solving, and research ideation. Our analyses reveal that these metrics exhibit limited consistency, capturing different dimensions of creativity. We highlight key limitations, including the creativity index's focus on lexical diversity, perplexity's sensitivity to model confidence, and syntactic templates' inability to capture conceptual creativity. Additionally, LLM-as-a-Judge shows instability and bias. Our findings underscore the need for more robust, generalizable evaluation frameworks that better align with human judgments of creativity.

Paper Structure

This paper contains 27 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Illustrations of creativity metrics selected for evaluation. (a) Creativity index captures phrase-level originality by identifying reused segments from large-scale web datasets. (b) Perplexity measures token-level unexpectedness via language model prediction probabilities. (c) Syntactic templates evaluate structure-level novelty by detecting reliance on common syntactic patterns. (d) LLM-as-a-judge incorporates rubric-based scoring and chain-of-thought reasoning to assess overall creative quality.
  • Figure 2: L-uniqueness v.s. minimum n-gram length L. Creativity index is sensitive to L.
  • Figure 3: CI value v.s. L-uniqueness Range. As L-uniqueness increases, the CI difference decreases.
  • Figure 4: Template rate by n-gram Template. As the top 100 most common templates are selected, their actual occurrence rate varies with n.
  • Figure 5: Example of the 8-gram template. LLM-generated plots reuse certain sentences, whereas human-written plots display more varied word choices.
  • ...and 1 more figures