Hierarchical Knowledge Graphs for Story Understanding in Visual Narratives
Yi-Chun Chen
TL;DR
The paper presents a hierarchical knowledge-graph framework for visual narrative understanding, organizing content across panel, event, and macro-event levels to enable interpretable, cross-modal reasoning. It integrates panel-level multimodal graphs with sequence-level temporal graphs and event-level semantic graphs, linking them via cross-level relations and a structured annotation scheme applied to a manually annotated Manga109 subset. Four symbolic reasoning tasks (action retrieval, dialogue tracing, character appearance mapping, panel timeline reconstruction) demonstrate the framework’s interpretability and cross-level inference, with near-perfect quantitative results and rich qualitative demonstrations. The work offers a scalable, explainable foundation for narrative analysis, authoring tools, and interactive media applications, and points to future extensions in longer narratives, semantic normalization, and model-assisted annotation pipelines.
Abstract
We present a hierarchical knowledge graph framework for the structured semantic understanding of visual narratives, using comics as a representative domain for multimodal storytelling. The framework organizes narrative content across three levels-panel, event, and macro-event, by integrating symbolic graphs that encode semantic, spatial, and temporal relationships. At the panel level, it models visual elements such as characters, objects, and actions alongside textual components including dialogue and narration. These are systematically connected to higher-level graphs that capture narrative sequences and abstract story structures. Applied to a manually annotated subset of the Manga109 dataset, the framework supports interpretable symbolic reasoning across four representative tasks: action retrieval, dialogue tracing, character appearance mapping, and timeline reconstruction. Rather than prioritizing predictive performance, the system emphasizes transparency in narrative modeling and enables structured inference aligned with cognitive theories of event segmentation and visual storytelling. This work contributes to explainable narrative analysis and offers a foundation for authoring tools, narrative comprehension systems, and interactive media applications.
