What Makes a Good Story and How Can We Measure It? A Comprehensive Survey of Story Evaluation

Dingyi Yang; Qin Jin

What Makes a Good Story and How Can We Measure It? A Comprehensive Survey of Story Evaluation

Dingyi Yang, Qin Jin

TL;DR

This survey tackles the multifaceted problem of story evaluation by surveying creation tasks across text-only and multimodal inputs, defining human-centered evaluation criteria, and collating benchmark datasets. It introduces a taxonomy that separates traditional and LLM-based metrics, detailing their methods and outputs, and analyzes how well they align with human judgments on standard benchmarks. The paper highlights the rising role of large language models in evaluation, including opportunities and biases, and discusses human-AI collaboration as a path toward more reliable assessment. Finally, it offers concrete recommendations for standardized criteria, extended benchmarks, long-form and personalized evaluation, fairness and robustness, and collaborative frameworks, aiming to drive more consistent and interpretable evaluation in story generation research.

Abstract

With the development of artificial intelligence, particularly the success of Large Language Models (LLMs), the quantity and quality of automatically generated stories have significantly increased. This has led to the need for automatic story evaluation to assess the generative capabilities of computing systems and analyze the quality of both automatic-generated and human-written stories. Evaluating a story can be more challenging than other generation evaluation tasks. While tasks like machine translation primarily focus on assessing the aspects of fluency and accuracy, story evaluation demands complex additional measures such as overall coherence, character development, interestingness, etc. This requires a thorough review of relevant research. In this survey, we first summarize existing storytelling tasks, including text-to-text, visual-to-text, and text-to-visual. We highlight their evaluation challenges, identify various human criteria to measure stories, and present existing benchmark datasets. Then, we propose a taxonomy to organize evaluation metrics that have been developed or can be adopted for story evaluation. We also provide descriptions of these metrics, along with the discussion of their merits and limitations. Later, we discuss the human-AI collaboration for story evaluation and generation. Finally, we suggest potential future research directions, extending from story evaluation to general evaluations.

What Makes a Good Story and How Can We Measure It? A Comprehensive Survey of Story Evaluation

TL;DR

Abstract

Paper Structure (53 sections, 5 equations, 5 figures, 6 tables)

This paper contains 53 sections, 5 equations, 5 figures, 6 tables.

Introduction
Story Generation Tasks
Text-to-Text
Visual-to-Text
Text-to-Visual
Story Evaluation Criteria and Benchmark Datasets
Story Evaluation Criteria
Story Evaluation Benchmark Datasets
Taxonomy of Evaluation Metrics
Evaluation Methods
Evaluation Output Format
Traditional Evaluation
Lexical-based Metrics
Embedding-based Metrics
Probability-based Metrics
...and 38 more sections

Figures (5)

Figure 1: General Framework of Story Evaluation, which shows the evaluation inputs and output formats (Section \ref{['sec:format']}). All the dashed boxes are optional input or output.
Figure 2: Taxonomy of evaluation metrics proposed or can be adopted for story evaluation. The metrics that are specifically proposed for story evaluation are colored.
Figure 3: Illustration of different types of neural models applied for automatic evaluation metrics (all the dashed boxes are optional input or output): (a) Embedding-Based Methods, which evaluate based on separately encoded vectors (left) or a jointly encoded vector (right); (b) Probability-Based Methods, which calculate based on the generation probability of the target story; (c) Generative-Based Methods, which directly generate the evaluation results, with or without the reasoning process. These three types of models can be fine-tuned on evaluation benchmarks, referred to as Trained Metrics.
Figure 4: The Pearson Correlation between various metrics and human ratings on OpenMEVA (ROC) benchmark dataset.
Figure 5: The Kendall Correlation between powerful metrics and multi-aspect human ratings proposed by xie2023next.

What Makes a Good Story and How Can We Measure It? A Comprehensive Survey of Story Evaluation

TL;DR

Abstract

What Makes a Good Story and How Can We Measure It? A Comprehensive Survey of Story Evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)