Table of Contents
Fetching ...

Improving Generation and Evaluation of Visual Stories via Semantic Consistency

Adyasha Maharana, Darryl Hannan, Mohit Bansal

TL;DR

The paper tackles the challenge of generating coherent visual stories from sequences of captions by introducing DuCo-StoryGAN, which combines MART-based context encoding, dual learning via video redescription, and a copy-transform mechanism to improve cross-frame consistency and semantic alignment. It extends the StoryGAN framework and demonstrates substantial gains in character fidelity, caption relevance, and global narrative coherence, supported by a diverse automatic evaluation suite and human judgments. The authors provide extensive ablations, qualitative analyses, and linguistic insights, and they introduce Hierarchical DAMSM-based metrics to better capture story-level alignment. They also discuss ethics and broader impacts, acknowledging dataset limitations and the need for more diverse visual storytelling data to generalize beyond cartoon domains.

Abstract

Story visualization is an under-explored task that falls at the intersection of many important research directions in both computer vision and natural language processing. In this task, given a series of natural language captions which compose a story, an agent must generate a sequence of images that correspond to the captions. Prior work has introduced recurrent generative models which outperform text-to-image synthesis models on this task. However, there is room for improvement of generated images in terms of visual quality, coherence and relevance. We present a number of improvements to prior modeling approaches, including (1) the addition of a dual learning framework that utilizes video captioning to reinforce the semantic alignment between the story and generated images, (2) a copy-transform mechanism for sequentially-consistent story visualization, and (3) MART-based transformers to model complex interactions between frames. We present ablation studies to demonstrate the effect of each of these techniques on the generative power of the model for both individual images as well as the entire narrative. Furthermore, due to the complexity and generative nature of the task, standard evaluation metrics do not accurately reflect performance. Therefore, we also provide an exploration of evaluation metrics for the model, focused on aspects of the generated frames such as the presence/quality of generated characters, the relevance to captions, and the diversity of the generated images. We also present correlation experiments of our proposed automated metrics with human evaluations. Code and data available at: https://github.com/adymaharana/StoryViz

Improving Generation and Evaluation of Visual Stories via Semantic Consistency

TL;DR

The paper tackles the challenge of generating coherent visual stories from sequences of captions by introducing DuCo-StoryGAN, which combines MART-based context encoding, dual learning via video redescription, and a copy-transform mechanism to improve cross-frame consistency and semantic alignment. It extends the StoryGAN framework and demonstrates substantial gains in character fidelity, caption relevance, and global narrative coherence, supported by a diverse automatic evaluation suite and human judgments. The authors provide extensive ablations, qualitative analyses, and linguistic insights, and they introduce Hierarchical DAMSM-based metrics to better capture story-level alignment. They also discuss ethics and broader impacts, acknowledging dataset limitations and the need for more diverse visual storytelling data to generalize beyond cartoon domains.

Abstract

Story visualization is an under-explored task that falls at the intersection of many important research directions in both computer vision and natural language processing. In this task, given a series of natural language captions which compose a story, an agent must generate a sequence of images that correspond to the captions. Prior work has introduced recurrent generative models which outperform text-to-image synthesis models on this task. However, there is room for improvement of generated images in terms of visual quality, coherence and relevance. We present a number of improvements to prior modeling approaches, including (1) the addition of a dual learning framework that utilizes video captioning to reinforce the semantic alignment between the story and generated images, (2) a copy-transform mechanism for sequentially-consistent story visualization, and (3) MART-based transformers to model complex interactions between frames. We present ablation studies to demonstrate the effect of each of these techniques on the generative power of the model for both individual images as well as the entire narrative. Furthermore, due to the complexity and generative nature of the task, standard evaluation metrics do not accurately reflect performance. Therefore, we also provide an exploration of evaluation metrics for the model, focused on aspects of the generated frames such as the presence/quality of generated characters, the relevance to captions, and the diversity of the generated images. We also present correlation experiments of our proposed automated metrics with human evaluations. Code and data available at: https://github.com/adymaharana/StoryViz

Paper Structure

This paper contains 44 sections, 9 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Illustration of the Pororo-SV dataset (Captions & Ground Truth) and the corresponding images generated from our model (Generated).
  • Figure 2: Illustration of DuCo-StoryGAN architecture. The story encoder is used to initialize the memory module in the MART context encoder, which encodes the captions for the image generator. The copy-transform mechanism copies features from the images generated in previous timesteps to the image in current timestep. The generated images are passed to the story and image discriminators, and the dual learning video captioning model.
  • Figure 3: Sample results from StoryGAN and DuCo-StoryGAN on unseen test split.
  • Figure 4: Progression of character classification scores (top) and generated images (bottom) with training.
  • Figure 5: Comparative examples of generated images.
  • ...and 3 more figures