SCO-VIST: Social Interaction Commonsense Knowledge-based Visual Storytelling
Eileen Wang, Soyeon Caren Han, Josiah Poon
TL;DR
Visual storytelling requires integrating social-interaction commonsense with the depicted events. SCO-VIST builds a heterogeneous story graph from per-image captions, BEFORE/AFTER commonsense inferences via Comet-ATOMIC2020, and image themes, then learns edge weights with cosine similarity or PMI and a Temporal Graph Convolution Network, training with $L(\theta) = -\sum_{t=1}^{T} \log p_{\theta}(y_t^*|y_1^*,...,y_{t-1}^*)$. The optimal storyline is extracted as the path maximizing $\max \sum w(e)$, solved by negating weights and applying Floyd–Warshall, before decoding with a BART-based generator. Across the Visual Storytelling (VIST) dataset, SCO-VIST improves visual grounding, coherence, and diversity, with strong automatic and human evaluation results that validate its socially informed storytelling capability and its effective integration of graph-based planning with neural generation.
Abstract
Visual storytelling aims to automatically generate a coherent story based on a given image sequence. Unlike tasks like image captioning, visual stories should contain factual descriptions, worldviews, and human social commonsense to put disjointed elements together to form a coherent and engaging human-writeable story. However, most models mainly focus on applying factual information and using taxonomic/lexical external knowledge when attempting to create stories. This paper introduces SCO-VIST, a framework representing the image sequence as a graph with objects and relations that includes human action motivation and its social interaction commonsense knowledge. SCO-VIST then takes this graph representing plot points and creates bridges between plot points with semantic and occurrence-based edge weights. This weighted story graph produces the storyline in a sequence of events using Floyd-Warshall's algorithm. Our proposed framework produces stories superior across multiple metrics in terms of visual grounding, coherence, diversity, and humanness, per both automatic and human evaluations.
