Table of Contents
Fetching ...

Do Language Models Enjoy Their Own Stories? Prompting Large Language Models for Automatic Story Evaluation

Cyril Chhun, Fabian M. Suchanek, Chloé Clavel

TL;DR

The paper investigates whether large language models can act as substitutes for human annotators in automatic story evaluation (ASE) and analyzes their performance for automatic story generation (ASG). Using HANNA criteria and four Eval-Prompts, the study assesses LLMs across system-level and overall correlations with human judgments, prompt effects, and explanations, revealing high system-level alignment but limited explainability. It also analyzes ASG performance, showing larger, open-source models achieve ASG scores comparable to or exceeding human stories and that pretraining data influences results. The findings suggest LLMs are useful for scalable ASE at the system level and for ranking-generation models, while highlighting caveats around single-story judgments, explanation quality, and data contamination, with practical implications for research transparency and reproducibility.

Abstract

Storytelling is an integral part of human experience and plays a crucial role in social interactions. Thus, Automatic Story Evaluation (ASE) and Generation (ASG) could benefit society in multiple ways, but they are challenging tasks which require high-level human abilities such as creativity, reasoning and deep understanding. Meanwhile, Large Language Models (LLM) now achieve state-of-the-art performance on many NLP tasks. In this paper, we study whether LLMs can be used as substitutes for human annotators for ASE. We perform an extensive analysis of the correlations between LLM ratings, other automatic measures, and human annotations, and we explore the influence of prompting on the results and the explainability of LLM behaviour. Most notably, we find that LLMs outperform current automatic measures for system-level evaluation but still struggle at providing satisfactory explanations for their answers.

Do Language Models Enjoy Their Own Stories? Prompting Large Language Models for Automatic Story Evaluation

TL;DR

The paper investigates whether large language models can act as substitutes for human annotators in automatic story evaluation (ASE) and analyzes their performance for automatic story generation (ASG). Using HANNA criteria and four Eval-Prompts, the study assesses LLMs across system-level and overall correlations with human judgments, prompt effects, and explanations, revealing high system-level alignment but limited explainability. It also analyzes ASG performance, showing larger, open-source models achieve ASG scores comparable to or exceeding human stories and that pretraining data influences results. The findings suggest LLMs are useful for scalable ASE at the system level and for ranking-generation models, while highlighting caveats around single-story judgments, explanation quality, and data contamination, with practical implications for research transparency and reproducibility.

Abstract

Storytelling is an integral part of human experience and plays a crucial role in social interactions. Thus, Automatic Story Evaluation (ASE) and Generation (ASG) could benefit society in multiple ways, but they are challenging tasks which require high-level human abilities such as creativity, reasoning and deep understanding. Meanwhile, Large Language Models (LLM) now achieve state-of-the-art performance on many NLP tasks. In this paper, we study whether LLMs can be used as substitutes for human annotators for ASE. We perform an extensive analysis of the correlations between LLM ratings, other automatic measures, and human annotations, and we explore the influence of prompting on the results and the explainability of LLM behaviour. Most notably, we find that LLMs outperform current automatic measures for system-level evaluation but still struggle at providing satisfactory explanations for their answers.
Paper Structure (50 sections, 4 equations, 9 figures, 9 tables)

This paper contains 50 sections, 4 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Example Eval-Prompt and answer from our experiments. "Prompt" inside the Eval-Prompt refers to the story-prompt.
  • Figure 2: Schema of the performed ASE experiments. RE, CH, etc. are the considered human criteria (\ref{['sub:methodology']}). "EP" means "Eval-Prompt", defined in \ref{['sub:methodology']}. For the user study (\ref{['ssub:user_study']}), we randomly sampled 100 explanations from our experiments.
  • Figure 3: Example Eval-Prompts for the Surprise criterion. Eval-Prompt 2 is the same as Eval-Prompt 1 with "explain your answer" added at the end. "Prompt" (bold) refers to the story-prompt.
  • Figure 4: Overall absolute Kendall correlations between evaluation measures and human ratings. Higher is better. The black vertical line separates LLMs (left) and non-LLMs (right). Coefficient values are multiplied by 100 for readability; we will symbolize this with "($\times$100)" in the next figures.
  • Figure 5: System-level absolute Kendall correlations ($\times$100) between evaluation measures and human ratings. Higher is better. The white vertical line separates LLMs (left) and non-LLMs (right).
  • ...and 4 more figures