Table of Contents
Fetching ...

Automated Creativity Evaluation for Large Language Models: A Reference-Based Approach

Ruizhe Li, Chiwei Zhu, Benfeng Xu, Xiaorui Wang, Zhendong Mao

TL;DR

This paper tackles the challenge of evaluating LLM creativity by turning TTCW into a scalable, automated, reference-based assessment. It introduces a Likert-style, binary-criteria framework evaluated against high-quality reference texts, using an analyze-rate prompting strategy to improve reliability. On a New Yorker plot-based dataset, the method achieves a higher alignment with human judgments, reaching a pairwise accuracy of $0.75$ ($+15\%$) over baselines, and shows robustness to dataset and reference variations. The approach offers a practical benchmark for model comparison in creative generation, with clear guidance on scoring, scoring granularity, and generalization potential.

Abstract

Creative writing is a key capability of Large Language Models (LLMs), with potential applications in literature, storytelling, and various creative domains. However, evaluating the creativity of machine-generated texts remains a significant challenge, as existing methods either rely on costly manual annotations or fail to align closely with human assessments. In this paper, we propose an effective automated evaluation method based on the Torrance Test of Creative Writing (TTCW), which evaluates creativity as product. Our method employs a reference-based Likert-style approach, scoring generated creative texts relative to high-quality reference texts across various tests. Experimental results demonstrate that our method significantly improves the alignment between LLM evaluations and human assessments, achieving a pairwise accuracy of 0.75 (+15\%).

Automated Creativity Evaluation for Large Language Models: A Reference-Based Approach

TL;DR

This paper tackles the challenge of evaluating LLM creativity by turning TTCW into a scalable, automated, reference-based assessment. It introduces a Likert-style, binary-criteria framework evaluated against high-quality reference texts, using an analyze-rate prompting strategy to improve reliability. On a New Yorker plot-based dataset, the method achieves a higher alignment with human judgments, reaching a pairwise accuracy of () over baselines, and shows robustness to dataset and reference variations. The approach offers a practical benchmark for model comparison in creative generation, with clear guidance on scoring, scoring granularity, and generalization potential.

Abstract

Creative writing is a key capability of Large Language Models (LLMs), with potential applications in literature, storytelling, and various creative domains. However, evaluating the creativity of machine-generated texts remains a significant challenge, as existing methods either rely on costly manual annotations or fail to align closely with human assessments. In this paper, we propose an effective automated evaluation method based on the Torrance Test of Creative Writing (TTCW), which evaluates creativity as product. Our method employs a reference-based Likert-style approach, scoring generated creative texts relative to high-quality reference texts across various tests. Experimental results demonstrate that our method significantly improves the alignment between LLM evaluations and human assessments, achieving a pairwise accuracy of 0.75 (+15\%).

Paper Structure

This paper contains 30 sections, 5 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Distribution of TTCW test scores across different models. The dashed lines indicate the average number of tests passed by each model.
  • Figure 2: Complete Spearman correlation results across individual stories and models. Models labeled 'ours' indicate performance using our proposed method. The results are sorted in descending order of the average values.
  • Figure 3: Complete Kendall’s tau results across individual stories and models. Models labeled 'ours' indicate performance using our proposed method. The results are sorted in descending order of the average values.
  • Figure 4: Complete Pairwise accuracy results across individual stories and models. Models labeled 'ours' indicate performance using our proposed method. The results are sorted in descending order of the average values.
  • Figure 5: Spearman correlation performance under different cutoff scores.
  • ...and 3 more figures