Art or Artifice? Large Language Models and the False Promise of Creativity

Tuhin Chakrabarty; Philippe Laban; Divyansh Agarwal; Smaranda Muresan; Chien-Sheng Wu

Art or Artifice? Large Language Models and the False Promise of Creativity

Tuhin Chakrabarty, Philippe Laban, Divyansh Agarwal, Smaranda Muresan, Chien-Sheng Wu

TL;DR

The paper tackles the challenge of objectively evaluating creativity in writing by adapting the Torrance Test of Creative Thinking into a product-oriented framework (TTCW) using the Consensual Assessment Technique. It builds a benchmark of 48 short stories (12 human-written and 36 from top LLMs) assessed by 10 creative-writing experts, revealing that LLMs lag behind humans across multiple TTCW dimensions and that LLMs cannot reliably rate creativity either. The authors formalize 14 binary TTCW tests across Fluency, Flexibility, Originality, and Elaboration, implement an artifact-centric evaluation, and demonstrate moderate inter-rater reliability (Fleiss’ κ ≈ 0.41) with strong aggregate agreement (ρ ≈ 0.69). A parallel exploration tests LLMs as TTCW assessors, finding near-zero correlations with expert judgments and suggesting LLMs are not yet capable evaluators, though they hold potential as interactive co-writing tools. The work provides a large, open TTCW annotation dataset and discusses universal applicability, limitations, and directions for future research and writing-support tools.

Abstract

Researchers have argued that large language models (LLMs) exhibit high-quality writing capabilities from blogs to stories. However, evaluating objectively the creativity of a piece of writing is challenging. Inspired by the Torrance Test of Creative Thinking (TTCT), which measures creativity as a process, we use the Consensual Assessment Technique [3] and propose the Torrance Test of Creative Writing (TTCW) to evaluate creativity as a product. TTCW consists of 14 binary tests organized into the original dimensions of Fluency, Flexibility, Originality, and Elaboration. We recruit 10 creative writers and implement a human assessment of 48 stories written either by professional authors or LLMs using TTCW. Our analysis shows that LLM-generated stories pass 3-10X less TTCW tests than stories written by professionals. In addition, we explore the use of LLMs as assessors to automate the TTCW evaluation, revealing that none of the LLMs positively correlate with the expert assessments.

Art or Artifice? Large Language Models and the False Promise of Creativity

TL;DR

Abstract

Paper Structure (64 sections, 6 figures, 27 tables)

This paper contains 64 sections, 6 figures, 27 tables.

Introduction
Related Work
Creativity Evaluation
Evaluating Creative Writing
Expert Evaluation of Language Model Generations
Design Considerations
Design Principle 1: Leveraging the Torrance Test Metrics.
Design Principle 2: Artifact-centric Testing.
Design Principle 3: Binary (Yes-No) Questions with Open-Ended Rationales.
Design Principle 4: Additive Nature of Tests.
Formative Study: Formulating the Torrance Tests For Creative Writing
From Measures to Actionable Tests
The Torrance Test for Creative Writing
Fluency
Narrative Pacing (TTCW Fluency1): Does the manipulation of time in terms of compression or stretching feel appropriate and balanced?
...and 49 more sections

Figures (6)

Figure 1: Pipeline showing the construction of TTCW and evaluation of short stories using the TTCW framework where Step 1) shows how experts leverage the process-oriented Torrance Test of Creative Thinking to create 14 tests for evaluating creativity in short stories as a product. Step 2) demonstrates artifact-centric testing where 4 stories based on a single plot are used as a product of creativity evaluation Step3) shows an evaluation of stories using the TTCW framework by both expert humans and LLMs where they each provide Yes/No answers to individual tests followed by natural language rationales justifying their decision.
Figure 2: Pipeline showing how our test set is created for evaluation. For each human-written original New Yorker story, we generate 3 stories from one LLM each, based on the plot of the original story. The plot is a single-sentence summary of the original story automatically generated by GPT-4 and verified by humans.
Figure 3: Distribution of aggregate TTCW results, in which only the number of tests passed is retained.
Figure 4: Relative Evaluation Left figure showing ranking preference assigned to each story within a group. Right figure showing how creative experts attributed any given story from The NewYorker or 3 LLMs to one of the options between An experienced writer, An amateur writer, or An AI
Figure 5: Distribution of word count of stories in our test set
...and 1 more figures

Art or Artifice? Large Language Models and the False Promise of Creativity

TL;DR

Abstract

Art or Artifice? Large Language Models and the False Promise of Creativity

Authors

TL;DR

Abstract

Table of Contents

Figures (6)