Evaluating Creative Short Story Generation in Humans and Large Language Models
Mete Ismayilzada, Claire Stevenson, Lonneke van der Plas
TL;DR
This study systematically compares creativity in short-story generation between 60 humans and 60 large language models using a five-sentence cue-word task and multi-dimensional creativity metrics (diversity, novelty, surprise, complexity). It combines automated metrics with judgments from expert and non-expert humans as well as three LLM judges, revealing that while LLMs produce linguistically complex stories, they lag humans in novelty, surprise, and diversity; expert judgments align with automated metrics, whereas non-experts and LLM judges tend to rate AI-generated stories as more creative. The work also shows that experts better differentiate human vs AI authorship, and that”complexity” factors can inflate perceived creativity among non-experts and LLM judges. These findings have implications for evaluating human and artificial creativity, informing prompt design and model steering to better align AI creativity with human-valued novelty and surprise.
Abstract
Story-writing is a fundamental aspect of human imagination, relying heavily on creativity to produce narratives that are novel, effective, and surprising. While large language models (LLMs) have demonstrated the ability to generate high-quality stories, their creative story-writing capabilities remain under-explored. In this work, we conduct a systematic analysis of creativity in short story generation across 60 LLMs and 60 people using a five-sentence cue-word-based creative story-writing task. We use measures to automatically evaluate model- and human-generated stories across several dimensions of creativity, including novelty, surprise, diversity, and linguistic complexity. We also collect creativity ratings and Turing Test classifications from non-expert and expert human raters and LLMs. Automated metrics show that LLMs generate stylistically complex stories, but tend to fall short in terms of novelty, surprise and diversity when compared to average human writers. Expert ratings generally coincide with automated metrics. However, LLMs and non-experts rate LLM stories to be more creative than human-generated stories. We discuss why and how these differences in ratings occur, and their implications for both human and artificial creativity.
