Table of Contents
Fetching ...

Narrative Trails: A Method for Coherent Storyline Extraction via Maximum Capacity Path Optimization

Fausto German, Brian Keith, Chris North

TL;DR

Narrative Trails reframes narrative extraction as a maximum capacity path problem on a coherence graph built from latent-space document representations, enabling efficient extraction of up to $k$ distinct storylines between user-specified endpoints. The method combines a projection-space representation (via embeddings, UMAP, and HDBSCAN), a coherence graph with base and sparse coherence measures, and a maximin path-optimization procedure with redundancy reduction. Empirical results across Wikispeedia and multiple datasets show that Narrative Trails achieves higher coherence and reliability than baselines and is faster than Narrative Maps, with demonstrated generalizability and potential for multimodal extensions. This approach offers a scalable, abstractive alternative for sensemaking and information synthesis in large, diverse corpora.

Abstract

Traditional information retrieval is primarily concerned with finding relevant information from large datasets without imposing a structure within the retrieved pieces of data. However, structuring information in the form of narratives--ordered sets of documents that form coherent storylines--allows us to identify, interpret, and share insights about the connections and relationships between the ideas presented in the data. Despite their significance, current approaches for algorithmically extracting storylines from data are scarce, with existing methods primarily relying on intricate word-based heuristics and auxiliary document structures. Moreover, many of these methods are difficult to scale to large datasets and general contexts, as they are designed to extract storylines for narrow tasks. In this paper, we propose Narrative Trails, an efficient, general-purpose method for extracting coherent storylines in large text corpora. Specifically, our method uses the semantic-level information embedded in the latent space of deep learning models to build a sparse coherence graph and extract narratives that maximize the minimum coherence of the storylines. By quantitatively evaluating our proposed methods on two distinct narrative extraction tasks, we show the generalizability and scalability of Narrative Trails in multiple contexts while also simplifying the extraction pipeline.

Narrative Trails: A Method for Coherent Storyline Extraction via Maximum Capacity Path Optimization

TL;DR

Narrative Trails reframes narrative extraction as a maximum capacity path problem on a coherence graph built from latent-space document representations, enabling efficient extraction of up to distinct storylines between user-specified endpoints. The method combines a projection-space representation (via embeddings, UMAP, and HDBSCAN), a coherence graph with base and sparse coherence measures, and a maximin path-optimization procedure with redundancy reduction. Empirical results across Wikispeedia and multiple datasets show that Narrative Trails achieves higher coherence and reliability than baselines and is faster than Narrative Maps, with demonstrated generalizability and potential for multimodal extensions. This approach offers a scalable, abstractive alternative for sensemaking and information synthesis in large, diverse corpora.

Abstract

Traditional information retrieval is primarily concerned with finding relevant information from large datasets without imposing a structure within the retrieved pieces of data. However, structuring information in the form of narratives--ordered sets of documents that form coherent storylines--allows us to identify, interpret, and share insights about the connections and relationships between the ideas presented in the data. Despite their significance, current approaches for algorithmically extracting storylines from data are scarce, with existing methods primarily relying on intricate word-based heuristics and auxiliary document structures. Moreover, many of these methods are difficult to scale to large datasets and general contexts, as they are designed to extract storylines for narrow tasks. In this paper, we propose Narrative Trails, an efficient, general-purpose method for extracting coherent storylines in large text corpora. Specifically, our method uses the semantic-level information embedded in the latent space of deep learning models to build a sparse coherence graph and extract narratives that maximize the minimum coherence of the storylines. By quantitatively evaluating our proposed methods on two distinct narrative extraction tasks, we show the generalizability and scalability of Narrative Trails in multiple contexts while also simplifying the extraction pipeline.

Paper Structure

This paper contains 23 sections, 2 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Narrative Trails extraction pipeline. Given two user-selected documents, the algorithm finds storylines that connect them with maximum capacity for coherence.
  • Figure 2: Comparison of extraction execution time per storyline between Narrative Trails and Narrative Maps. The diagonal lines indicate the error bands for execution time across different datasets for each algorithm.
  • Figure 3: Storylines about the COVID-19 pandemic's impact on global flights in January 2020, extracted from a collection of news articles using Narrative Trails (blue) and Narrative Maps (orange). The dashed gray lines represent the DTW matching between the storylines, and the dashed colored lines are the weakest links.

Theorems & Definitions (1)

  • Definition 1: Narrative Trail