Table of Contents
Fetching ...

Let Storytelling Tell Vivid Stories: An Expressive and Fluent Multimodal Storyteller

Chuanqi Zang, Jiji Tang, Rongsheng Zhang, Zeng Zhao, Tangjie Lv, Mingtao Pei, Wei Liang

TL;DR

This work proposes a new pipeline, termed LLaMS, to generate multimodal human-level stories that are embodied in expressiveness and consistency and employs a sequence data auto-enhancement strategy to enhance factual content expression and leverage a textual reasoning architecture for expressive story generation and prediction.

Abstract

Storytelling aims to generate reasonable and vivid narratives based on an ordered image stream. The fidelity to the image story theme and the divergence of story plots attract readers to keep reading. Previous works iteratively improved the alignment of multiple modalities but ultimately resulted in the generation of simplistic storylines for image streams. In this work, we propose a new pipeline, termed LLaMS, to generate multimodal human-level stories that are embodied in expressiveness and consistency. Specifically, by fully exploiting the commonsense knowledge within the LLM, we first employ a sequence data auto-enhancement strategy to enhance factual content expression and leverage a textual reasoning architecture for expressive story generation and prediction. Secondly, we propose SQ-Adatpter module for story illustration generation which can maintain sequence consistency. Numerical results are conducted through human evaluation to verify the superiority of proposed LLaMS. Evaluations show that LLaMS achieves state-of-the-art storytelling performance and 86% correlation and 100% consistency win rate as compared with previous SOTA methods. Furthermore, ablation experiments are conducted to verify the effectiveness of proposed sequence data enhancement and SQ-Adapter.

Let Storytelling Tell Vivid Stories: An Expressive and Fluent Multimodal Storyteller

TL;DR

This work proposes a new pipeline, termed LLaMS, to generate multimodal human-level stories that are embodied in expressiveness and consistency and employs a sequence data auto-enhancement strategy to enhance factual content expression and leverage a textual reasoning architecture for expressive story generation and prediction.

Abstract

Storytelling aims to generate reasonable and vivid narratives based on an ordered image stream. The fidelity to the image story theme and the divergence of story plots attract readers to keep reading. Previous works iteratively improved the alignment of multiple modalities but ultimately resulted in the generation of simplistic storylines for image streams. In this work, we propose a new pipeline, termed LLaMS, to generate multimodal human-level stories that are embodied in expressiveness and consistency. Specifically, by fully exploiting the commonsense knowledge within the LLM, we first employ a sequence data auto-enhancement strategy to enhance factual content expression and leverage a textual reasoning architecture for expressive story generation and prediction. Secondly, we propose SQ-Adatpter module for story illustration generation which can maintain sequence consistency. Numerical results are conducted through human evaluation to verify the superiority of proposed LLaMS. Evaluations show that LLaMS achieves state-of-the-art storytelling performance and 86% correlation and 100% consistency win rate as compared with previous SOTA methods. Furthermore, ablation experiments are conducted to verify the effectiveness of proposed sequence data enhancement and SQ-Adapter.
Paper Structure (26 sections, 7 equations, 5 figures, 3 tables)

This paper contains 26 sections, 7 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: In this work, we unify story generation and story prediction as storytelling and propose a novel pipeline for this task. Observing 3 images of a story, our work initially generates a vivid story across a reasonable storyline based on the factual events occurring in the images. Then, aligning this storyline, our work imagines and forecasts subsequent story developments, presenting them through textual descriptions and accompanying visual representations. The results exhibit comprehensive story semantics, both in story expressiveness (integrality, interestingness, correlation) and multiple story plots consistency.
  • Figure 2: Overview of Large Language model assisted Multimodal Storyteller (LLaMS) pipeline. Left is the process of sequence data enhancement from brief storylines to detailed descriptions to high-expressive (integral, interesting, correlated) and consistent stories. Right is the architecture for the storytelling task. During the inference stage, given 1$\sim$5 image of a story, we design an image-to-text model to generate textual stories based on images and a text-to-image model to generate style-consistent plots illustrations.
  • Figure 3: We propose the SQ-Adapter for consistent vision generation. A learned latent query scales high-dimensional inputs without length limitation to a fixed dimension, controlling the generated image style. The SQ-Adapter is trainable with only a few parameters, shown in red. The frozen stable diffusion module is shown in blue.
  • Figure 4: The qualitative results of story generation task (Given 5 images) and story prediction task (Given 3 images). To save space, we omit the textual results for some samples in the story prediction task, and we will show more complete results in the supplementary.
  • Figure 5: Three example Applications of our LLaMS. These examples demonstrate the flexible application of our framework by integrating story generation and story prediction.