Table of Contents
Fetching ...

SCO-VIST: Social Interaction Commonsense Knowledge-based Visual Storytelling

Eileen Wang, Soyeon Caren Han, Josiah Poon

TL;DR

Visual storytelling requires integrating social-interaction commonsense with the depicted events. SCO-VIST builds a heterogeneous story graph from per-image captions, BEFORE/AFTER commonsense inferences via Comet-ATOMIC2020, and image themes, then learns edge weights with cosine similarity or PMI and a Temporal Graph Convolution Network, training with $L(\theta) = -\sum_{t=1}^{T} \log p_{\theta}(y_t^*|y_1^*,...,y_{t-1}^*)$. The optimal storyline is extracted as the path maximizing $\max \sum w(e)$, solved by negating weights and applying Floyd–Warshall, before decoding with a BART-based generator. Across the Visual Storytelling (VIST) dataset, SCO-VIST improves visual grounding, coherence, and diversity, with strong automatic and human evaluation results that validate its socially informed storytelling capability and its effective integration of graph-based planning with neural generation.

Abstract

Visual storytelling aims to automatically generate a coherent story based on a given image sequence. Unlike tasks like image captioning, visual stories should contain factual descriptions, worldviews, and human social commonsense to put disjointed elements together to form a coherent and engaging human-writeable story. However, most models mainly focus on applying factual information and using taxonomic/lexical external knowledge when attempting to create stories. This paper introduces SCO-VIST, a framework representing the image sequence as a graph with objects and relations that includes human action motivation and its social interaction commonsense knowledge. SCO-VIST then takes this graph representing plot points and creates bridges between plot points with semantic and occurrence-based edge weights. This weighted story graph produces the storyline in a sequence of events using Floyd-Warshall's algorithm. Our proposed framework produces stories superior across multiple metrics in terms of visual grounding, coherence, diversity, and humanness, per both automatic and human evaluations.

SCO-VIST: Social Interaction Commonsense Knowledge-based Visual Storytelling

TL;DR

Visual storytelling requires integrating social-interaction commonsense with the depicted events. SCO-VIST builds a heterogeneous story graph from per-image captions, BEFORE/AFTER commonsense inferences via Comet-ATOMIC2020, and image themes, then learns edge weights with cosine similarity or PMI and a Temporal Graph Convolution Network, training with . The optimal storyline is extracted as the path maximizing , solved by negating weights and applying Floyd–Warshall, before decoding with a BART-based generator. Across the Visual Storytelling (VIST) dataset, SCO-VIST improves visual grounding, coherence, and diversity, with strong automatic and human evaluation results that validate its socially informed storytelling capability and its effective integration of graph-based planning with neural generation.

Abstract

Visual storytelling aims to automatically generate a coherent story based on a given image sequence. Unlike tasks like image captioning, visual stories should contain factual descriptions, worldviews, and human social commonsense to put disjointed elements together to form a coherent and engaging human-writeable story. However, most models mainly focus on applying factual information and using taxonomic/lexical external knowledge when attempting to create stories. This paper introduces SCO-VIST, a framework representing the image sequence as a graph with objects and relations that includes human action motivation and its social interaction commonsense knowledge. SCO-VIST then takes this graph representing plot points and creates bridges between plot points with semantic and occurrence-based edge weights. This weighted story graph produces the storyline in a sequence of events using Floyd-Warshall's algorithm. Our proposed framework produces stories superior across multiple metrics in terms of visual grounding, coherence, diversity, and humanness, per both automatic and human evaluations.
Paper Structure (24 sections, 3 equations, 8 figures, 4 tables)

This paper contains 24 sections, 3 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: SCO-VIST's proposed framework. In Stage 1, the caption, theme and commonsense nodes are created and connected with causal ordering to form the story graph. In Stage 2, edge weights are assigned using cosine similarity or point mutual information and further refined through graph learning. Stage 3 takes the final story graph, negates the weights and constructs the storyline by finding the shortest path between the left and right-most node. The storyline is then fed to a Transformer for story generation. The corresponding detailed view of the final story graph for this example is depicted in Appendix \ref{['Story Graph']}.
  • Figure 2: Count of unique unigrams for different part-of-speech (POS) tags for our proposed SRL-pmi vs. the 6 state-of-arts baselines.
  • Figure 3: Generated stories for our SRL-pmi model versus the 6 baselines models. Blue/red words represent concepts relevant/irrelevant to the image sequence.
  • Figure 4: AREL vs. SRL-pmi for an event-based story (above) and object-based story (below). Blue words indicate concepts implicitly or explicitly used in the generated story while red represents irrelevant concepts. Underlined words in the story represent concepts relevant to the image stream.
  • Figure 5: An example of a storyline and matching story generated using the SRL-pmi approach with different pre-trained image captioning models. Underlined words in the storyline are the image captions and blue words are visually relevant concepts to the image sequence.
  • ...and 3 more figures