Table of Contents
Fetching ...

TARN-VIST: Topic Aware Reinforcement Network for Visual Storytelling

Weiran Chen, Xin Li, Jiaqi Su, Guiqian Zhu, Ying Li, Yi Ji, Chunping Liu

TL;DR

This paper proposes a novel method, Topic Aware Reinforcement Network for VIsual StoryTelling (TARN-VIST), which pre-extracted the topic information of stories from both visual and linguistic perspectives and applies two topic-consistent reinforcement learning rewards to identify the discrepancy between the generated story and the human-labeled story.

Abstract

As a cross-modal task, visual storytelling aims to generate a story for an ordered image sequence automatically. Different from the image captioning task, visual storytelling requires not only modeling the relationships between objects in the image but also mining the connections between adjacent images. Recent approaches primarily utilize either end-to-end frameworks or multi-stage frameworks to generate relevant stories, but they usually overlook latent topic information. In this paper, in order to generate a more coherent and relevant story, we propose a novel method, Topic Aware Reinforcement Network for VIsual StoryTelling (TARN-VIST). In particular, we pre-extracted the topic information of stories from both visual and linguistic perspectives. Then we apply two topic-consistent reinforcement learning rewards to identify the discrepancy between the generated story and the human-labeled story so as to refine the whole generation process. Extensive experimental results on the VIST dataset and human evaluation demonstrate that our proposed model outperforms most of the competitive models across multiple evaluation metrics.

TARN-VIST: Topic Aware Reinforcement Network for Visual Storytelling

TL;DR

This paper proposes a novel method, Topic Aware Reinforcement Network for VIsual StoryTelling (TARN-VIST), which pre-extracted the topic information of stories from both visual and linguistic perspectives and applies two topic-consistent reinforcement learning rewards to identify the discrepancy between the generated story and the human-labeled story.

Abstract

As a cross-modal task, visual storytelling aims to generate a story for an ordered image sequence automatically. Different from the image captioning task, visual storytelling requires not only modeling the relationships between objects in the image but also mining the connections between adjacent images. Recent approaches primarily utilize either end-to-end frameworks or multi-stage frameworks to generate relevant stories, but they usually overlook latent topic information. In this paper, in order to generate a more coherent and relevant story, we propose a novel method, Topic Aware Reinforcement Network for VIsual StoryTelling (TARN-VIST). In particular, we pre-extracted the topic information of stories from both visual and linguistic perspectives. Then we apply two topic-consistent reinforcement learning rewards to identify the discrepancy between the generated story and the human-labeled story so as to refine the whole generation process. Extensive experimental results on the VIST dataset and human evaluation demonstrate that our proposed model outperforms most of the competitive models across multiple evaluation metrics.
Paper Structure (18 sections, 12 equations, 3 figures, 5 tables, 1 algorithm)

This paper contains 18 sections, 12 equations, 3 figures, 5 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview of TARN-VIST. In our model, image features are obtained by the pre-trained ResNet and then fed into the hierarchical decoder which consists of a manager LSTM and a worker LSTM to generate a sample story. Once the candidate story is generated, the two topic consistency rewards are combined to refine the generation process. Furthermore, we also set a classical sentence-level BLEU reward to control the fluency of the generated story.
  • Figure 2: Examples of extracted topic information
  • Figure 3: Example story generated from TARN-VIST and several competitive baselines.