Art or Artifice? Large Language Models and the False Promise of Creativity
Tuhin Chakrabarty, Philippe Laban, Divyansh Agarwal, Smaranda Muresan, Chien-Sheng Wu
TL;DR
The paper tackles the challenge of objectively evaluating creativity in writing by adapting the Torrance Test of Creative Thinking into a product-oriented framework (TTCW) using the Consensual Assessment Technique. It builds a benchmark of 48 short stories (12 human-written and 36 from top LLMs) assessed by 10 creative-writing experts, revealing that LLMs lag behind humans across multiple TTCW dimensions and that LLMs cannot reliably rate creativity either. The authors formalize 14 binary TTCW tests across Fluency, Flexibility, Originality, and Elaboration, implement an artifact-centric evaluation, and demonstrate moderate inter-rater reliability (Fleiss’ κ ≈ 0.41) with strong aggregate agreement (ρ ≈ 0.69). A parallel exploration tests LLMs as TTCW assessors, finding near-zero correlations with expert judgments and suggesting LLMs are not yet capable evaluators, though they hold potential as interactive co-writing tools. The work provides a large, open TTCW annotation dataset and discusses universal applicability, limitations, and directions for future research and writing-support tools.
Abstract
Researchers have argued that large language models (LLMs) exhibit high-quality writing capabilities from blogs to stories. However, evaluating objectively the creativity of a piece of writing is challenging. Inspired by the Torrance Test of Creative Thinking (TTCT), which measures creativity as a process, we use the Consensual Assessment Technique [3] and propose the Torrance Test of Creative Writing (TTCW) to evaluate creativity as a product. TTCW consists of 14 binary tests organized into the original dimensions of Fluency, Flexibility, Originality, and Elaboration. We recruit 10 creative writers and implement a human assessment of 48 stories written either by professional authors or LLMs using TTCW. Our analysis shows that LLM-generated stories pass 3-10X less TTCW tests than stories written by professionals. In addition, we explore the use of LLMs as assessors to automate the TTCW evaluation, revealing that none of the LLMs positively correlate with the expert assessments.
