Table of Contents
Fetching ...

Hierarchical Knowledge Graphs for Story Understanding in Visual Narratives

Yi-Chun Chen

TL;DR

The paper presents a hierarchical knowledge-graph framework for visual narrative understanding, organizing content across panel, event, and macro-event levels to enable interpretable, cross-modal reasoning. It integrates panel-level multimodal graphs with sequence-level temporal graphs and event-level semantic graphs, linking them via cross-level relations and a structured annotation scheme applied to a manually annotated Manga109 subset. Four symbolic reasoning tasks (action retrieval, dialogue tracing, character appearance mapping, panel timeline reconstruction) demonstrate the framework’s interpretability and cross-level inference, with near-perfect quantitative results and rich qualitative demonstrations. The work offers a scalable, explainable foundation for narrative analysis, authoring tools, and interactive media applications, and points to future extensions in longer narratives, semantic normalization, and model-assisted annotation pipelines.

Abstract

We present a hierarchical knowledge graph framework for the structured semantic understanding of visual narratives, using comics as a representative domain for multimodal storytelling. The framework organizes narrative content across three levels-panel, event, and macro-event, by integrating symbolic graphs that encode semantic, spatial, and temporal relationships. At the panel level, it models visual elements such as characters, objects, and actions alongside textual components including dialogue and narration. These are systematically connected to higher-level graphs that capture narrative sequences and abstract story structures. Applied to a manually annotated subset of the Manga109 dataset, the framework supports interpretable symbolic reasoning across four representative tasks: action retrieval, dialogue tracing, character appearance mapping, and timeline reconstruction. Rather than prioritizing predictive performance, the system emphasizes transparency in narrative modeling and enables structured inference aligned with cognitive theories of event segmentation and visual storytelling. This work contributes to explainable narrative analysis and offers a foundation for authoring tools, narrative comprehension systems, and interactive media applications.

Hierarchical Knowledge Graphs for Story Understanding in Visual Narratives

TL;DR

The paper presents a hierarchical knowledge-graph framework for visual narrative understanding, organizing content across panel, event, and macro-event levels to enable interpretable, cross-modal reasoning. It integrates panel-level multimodal graphs with sequence-level temporal graphs and event-level semantic graphs, linking them via cross-level relations and a structured annotation scheme applied to a manually annotated Manga109 subset. Four symbolic reasoning tasks (action retrieval, dialogue tracing, character appearance mapping, panel timeline reconstruction) demonstrate the framework’s interpretability and cross-level inference, with near-perfect quantitative results and rich qualitative demonstrations. The work offers a scalable, explainable foundation for narrative analysis, authoring tools, and interactive media applications, and points to future extensions in longer narratives, semantic normalization, and model-assisted annotation pipelines.

Abstract

We present a hierarchical knowledge graph framework for the structured semantic understanding of visual narratives, using comics as a representative domain for multimodal storytelling. The framework organizes narrative content across three levels-panel, event, and macro-event, by integrating symbolic graphs that encode semantic, spatial, and temporal relationships. At the panel level, it models visual elements such as characters, objects, and actions alongside textual components including dialogue and narration. These are systematically connected to higher-level graphs that capture narrative sequences and abstract story structures. Applied to a manually annotated subset of the Manga109 dataset, the framework supports interpretable symbolic reasoning across four representative tasks: action retrieval, dialogue tracing, character appearance mapping, and timeline reconstruction. Rather than prioritizing predictive performance, the system emphasizes transparency in narrative modeling and enables structured inference aligned with cognitive theories of event segmentation and visual storytelling. This work contributes to explainable narrative analysis and offers a foundation for authoring tools, narrative comprehension systems, and interactive media applications.

Paper Structure

This paper contains 35 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: The composition of visual narratives is inherently hierarchical. Story event networks consist of nested events and sub-events, where each event can be broken down into panel sequences. Each panel conveys specific details through multiple modalities, including visual content, text, and spatial composition.
  • Figure 2: Hierarchical narrative event design. Panels are grouped into event segments, which instantiate mid-level events and aggregate into macro-events.
  • Figure 3: Example of panel-level representation. (a) shows the original panel, and (b) shows its multimodal knowledge graph encoding visual and textual entities with semantic links.
  • Figure 4: Sequence-level illustration linking visual panels to their temporal knowledge graph. (a) shows the source panel sequence, while (b) depicts narrative relations such as reading order and event segmentation.
  • Figure 5: Reduced event-level semantic knowledge graph centered on the macro-event Think of family. The visualization shows a simplified subset for readability, including nodes across four layers: macro-event, events, event segments, and panels—connected by structured edges (subevent_of, precedes_reading, story_time, instantiates). This layout highlights the hierarchical organization of narrative structure and supports interpretable reasoning across levels.