Table of Contents
Fetching ...

NarraBench: A Comprehensive Framework for Narrative Benchmarking

Sil Hamilton, Matthew Wilkens, Andrew Piper

TL;DR

NarraBench introduces a theory-informed narrative-understanding benchmarking framework and a comprehensive survey of 78 benchmarks to diagnose gaps in current evaluation practices. It argues that existing benchmarks cover only about 27% of narratological concepts, with major underrepresentation of events, style, perspective, and perspectival judgments, and a paucity of open data and multimodal tests. The authors present a four-dimensional NarraBench taxonomy (Story, Narration, Discourse, Situatedness) with 12 features, 50 aspects, and three evaluation criteria (Scale, Mode, Variance), plus a testing harness to unify future benchmarks. They also discuss ethical considerations and provide a roadmap for expanding the framework to address non-deterministic, multilingual, and multimodal narrative understanding in NLP.

Abstract

We present NarraBench, a theory-informed taxonomy of narrative-understanding tasks, as well as an associated survey of 78 existing benchmarks in the area. We find significant need for new evaluations covering aspects of narrative understanding that are either overlooked in current work or are poorly aligned with existing metrics. Specifically, we estimate that only 27% of narrative tasks are well captured by existing benchmarks, and we note that some areas -- including narrative events, style, perspective, and revelation -- are nearly absent from current evaluations. We also note the need for increased development of benchmarks capable of assessing constitutively subjective and perspectival aspects of narrative, that is, aspects for which there is generally no single correct answer. Our taxonomy, survey, and methodology are of value to NLP researchers seeking to test LLM narrative understanding.

NarraBench: A Comprehensive Framework for Narrative Benchmarking

TL;DR

NarraBench introduces a theory-informed narrative-understanding benchmarking framework and a comprehensive survey of 78 benchmarks to diagnose gaps in current evaluation practices. It argues that existing benchmarks cover only about 27% of narratological concepts, with major underrepresentation of events, style, perspective, and perspectival judgments, and a paucity of open data and multimodal tests. The authors present a four-dimensional NarraBench taxonomy (Story, Narration, Discourse, Situatedness) with 12 features, 50 aspects, and three evaluation criteria (Scale, Mode, Variance), plus a testing harness to unify future benchmarks. They also discuss ethical considerations and provide a roadmap for expanding the framework to address non-deterministic, multilingual, and multimodal narrative understanding in NLP.

Abstract

We present NarraBench, a theory-informed taxonomy of narrative-understanding tasks, as well as an associated survey of 78 existing benchmarks in the area. We find significant need for new evaluations covering aspects of narrative understanding that are either overlooked in current work or are poorly aligned with existing metrics. Specifically, we estimate that only 27% of narrative tasks are well captured by existing benchmarks, and we note that some areas -- including narrative events, style, perspective, and revelation -- are nearly absent from current evaluations. We also note the need for increased development of benchmarks capable of assessing constitutively subjective and perspectival aspects of narrative, that is, aspects for which there is generally no single correct answer. Our taxonomy, survey, and methodology are of value to NLP researchers seeking to test LLM narrative understanding.

Paper Structure

This paper contains 49 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: The twelve primary narrative features of the NarraBench taxonomy coloured according to the Big-4 narrative dimensions (story, narration, discourse, and situatedness) and shaded by how well existing benchmarks match as determined by our survey.
  • Figure 2: NarraBench's primary theoretical foundations from genetteNarrativeDiscourseEssay1980 and herman2009basic.
  • Figure 3: Aligned open benchmarks for narrative understanding tasks identified in this survey.