Table of Contents
Fetching ...

Are Large Language Models Capable of Generating Human-Level Narratives?

Yufei Tian, Tenghao Huang, Miri Liu, Derek Jiang, Alexander Spangher, Muhao Chen, Jonathan May, Nanyun Peng

TL;DR

This paper investigates the capability of LLMs in storytelling, focusing on narrative development and plot progression, and introduces a novel computational framework to analyze narratives through three discourse-level aspects: i) story arcs, ii) turning points, and iii) affective dimensions, including arousal and valence.

Abstract

This paper investigates the capability of LLMs in storytelling, focusing on narrative development and plot progression. We introduce a novel computational framework to analyze narratives through three discourse-level aspects: i) story arcs, ii) turning points, and iii) affective dimensions, including arousal and valence. By leveraging expert and automatic annotations, we uncover significant discrepancies between the LLM- and human- written stories. While human-written stories are suspenseful, arousing, and diverse in narrative structures, LLM stories are homogeneously positive and lack tension. Next, we measure narrative reasoning skills as a precursor to generative capacities, concluding that most LLMs fall short of human abilities in discourse understanding. Finally, we show that explicit integration of aforementioned discourse features can enhance storytelling, as is demonstrated by over 40% improvement in neural storytelling in terms of diversity, suspense, and arousal.

Are Large Language Models Capable of Generating Human-Level Narratives?

TL;DR

This paper investigates the capability of LLMs in storytelling, focusing on narrative development and plot progression, and introduces a novel computational framework to analyze narratives through three discourse-level aspects: i) story arcs, ii) turning points, and iii) affective dimensions, including arousal and valence.

Abstract

This paper investigates the capability of LLMs in storytelling, focusing on narrative development and plot progression. We introduce a novel computational framework to analyze narratives through three discourse-level aspects: i) story arcs, ii) turning points, and iii) affective dimensions, including arousal and valence. By leveraging expert and automatic annotations, we uncover significant discrepancies between the LLM- and human- written stories. While human-written stories are suspenseful, arousing, and diverse in narrative structures, LLM stories are homogeneously positive and lack tension. Next, we measure narrative reasoning skills as a precursor to generative capacities, concluding that most LLMs fall short of human abilities in discourse understanding. Finally, we show that explicit integration of aforementioned discourse features can enhance storytelling, as is demonstrated by over 40% improvement in neural storytelling in terms of diversity, suspense, and arousal.
Paper Structure (36 sections, 20 figures, 12 tables)

This paper contains 36 sections, 20 figures, 12 tables.

Figures (20)

  • Figure 1: The story arc and turning point positions of human- and LLM- generated narratives. The vertical axis shows the character's fortune (bad to good), and the horizontal axis represents timeline (beginning to end). Compared with human storytellers, LLMs tend to (1) adopt homogeneously happier, less complex story arcs, (2) introduce plot turning points earlier in the timeline, and (3) have less suspense or fewer setbacks in their storylines. The impact of these differences grow as LLMs gain greater prominence in communicative patterns.
  • Figure 2: Violin plots showing the positions of five turning points: TP1 - opportunity, TP2 - change of plans, TP3 - point of no return, TP4 - major setback, and TP5 - climax. Relative positions (y-axis) are calculated by $\frac{\text{Index(TP$_i$)}}{\text{Total Length}}$. For example, 0.5 means that the turning point occurs exactly in the middle of the whole story. We observe early arrival for TP 4-5 in AI outputs, indicating bad pacing and a lack of intensity.
  • Figure 3: Arousal of human and GPT-4. Human stories consistently exhibit higher levels of suspense (greater arousal). The gap enlarges from the midpoint to the end.
  • Figure 4: The share of story arcs between human and GPT-4 generated stories show significant differences. GPT-4 is much more likely to generate story arcs with less inflections and happier endings than human stories.
  • Figure 5: Valence of human and GPT-4. Human-written stories have more setbacks than GPT-4 (lower valence). The gap enlarges from the midpoint to the end.
  • ...and 15 more figures