Table of Contents
Fetching ...

StoryImager: A Unified and Efficient Framework for Coherent Story Visualization and Completion

Ming Tao, Bing-Kun Bao, Hao Tang, Yaowei Wang, Changsheng Xu

TL;DR

This work tackles coherent story visualization and completion by addressing limitations of autoregressive frame-by-frame generation, heavy history encoders, and disjoint task training. It introduces StoryImager, a unified framework that uses Storyboard-Gen for bidirectional synthesis, a Target Frame Masking strategy to unify visualization and completion, and a Frame-Story Cross Attention Module along with a Contextual Feature Extractor to ensure local frame fidelity and global story coherence, all powered by a PEFT-enabled diffusion backbone. Experiments on Pororo-SV and Flintstones-SV show that StoryImager achieves superior FID and FSD scores, enables story completion, and reduces hardware and time requirements compared to AR-LDM, with strong human evaluation results confirming improved visual quality, consistency, and relevance. The approach yields a practical, versatile tool for coherent multi-frame storytelling with efficient use of pre-trained diffusion models and parameter-efficient fine-tuning, benefiting applications in entertainment, education, and multimedia storytelling.

Abstract

Story visualization aims to generate a series of realistic and coherent images based on a storyline. Current models adopt a frame-by-frame architecture by transforming the pre-trained text-to-image model into an auto-regressive manner. Although these models have shown notable progress, there are still three flaws. 1) The unidirectional generation of auto-regressive manner restricts the usability in many scenarios. 2) The additional introduced story history encoders bring an extremely high computational cost. 3) The story visualization and continuation models are trained and inferred independently, which is not user-friendly. To these ends, we propose a bidirectional, unified, and efficient framework, namely StoryImager. The StoryImager enhances the storyboard generative ability inherited from the pre-trained text-to-image model for a bidirectional generation. Specifically, we introduce a Target Frame Masking Strategy to extend and unify different story image generation tasks. Furthermore, we propose a Frame-Story Cross Attention Module that decomposes the cross attention for local fidelity and global coherence. Moreover, we design a Contextual Feature Extractor to extract contextual information from the whole storyline. The extensive experimental results demonstrate the excellent performance of our StoryImager. The code is available at https://github.com/tobran/StoryImager.

StoryImager: A Unified and Efficient Framework for Coherent Story Visualization and Completion

TL;DR

This work tackles coherent story visualization and completion by addressing limitations of autoregressive frame-by-frame generation, heavy history encoders, and disjoint task training. It introduces StoryImager, a unified framework that uses Storyboard-Gen for bidirectional synthesis, a Target Frame Masking strategy to unify visualization and completion, and a Frame-Story Cross Attention Module along with a Contextual Feature Extractor to ensure local frame fidelity and global story coherence, all powered by a PEFT-enabled diffusion backbone. Experiments on Pororo-SV and Flintstones-SV show that StoryImager achieves superior FID and FSD scores, enables story completion, and reduces hardware and time requirements compared to AR-LDM, with strong human evaluation results confirming improved visual quality, consistency, and relevance. The approach yields a practical, versatile tool for coherent multi-frame storytelling with efficient use of pre-trained diffusion models and parameter-efficient fine-tuning, benefiting applications in entertainment, education, and multimedia storytelling.

Abstract

Story visualization aims to generate a series of realistic and coherent images based on a storyline. Current models adopt a frame-by-frame architecture by transforming the pre-trained text-to-image model into an auto-regressive manner. Although these models have shown notable progress, there are still three flaws. 1) The unidirectional generation of auto-regressive manner restricts the usability in many scenarios. 2) The additional introduced story history encoders bring an extremely high computational cost. 3) The story visualization and continuation models are trained and inferred independently, which is not user-friendly. To these ends, we propose a bidirectional, unified, and efficient framework, namely StoryImager. The StoryImager enhances the storyboard generative ability inherited from the pre-trained text-to-image model for a bidirectional generation. Specifically, we introduce a Target Frame Masking Strategy to extend and unify different story image generation tasks. Furthermore, we propose a Frame-Story Cross Attention Module that decomposes the cross attention for local fidelity and global coherence. Moreover, we design a Contextual Feature Extractor to extract contextual information from the whole storyline. The extensive experimental results demonstrate the excellent performance of our StoryImager. The code is available at https://github.com/tobran/StoryImager.
Paper Structure (18 sections, 2 equations, 7 figures, 4 tables)

This paper contains 18 sections, 2 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: (a) Existing models adopt the auto-regressive generative approach, which restricts usability in many scenarios. And the users need to switch between models to meet their current requirements (b) Our proposed StoryImager unifies different tasks into one model, which is more comprehensive to tackle various generative requirements.
  • Figure 2: (a) Existing models introduce large models to encode the history information for auto-regressive generation. (b) The storyboard generative ability of Stable Diffusion rombach2022high learned from pretraining process. (c) Our proposed StoryImager inherits the storyboard generative ability and unifies different tasks through a masking strategy.
  • Figure 3: The architecture of StoryImager for story visualization and completion. Our StoryImager adopts a Storyboard-based Generation approach to enable bidirectional story image generation. It unifies different tasks through the Target Frame Masking Strategy.
  • Figure 4: (a) The architecture of Frame-Story Cross Attention Module. It decomposes the cross-attention module into story-level and frame-level cross-attention to enable local image fidelity and global story coherence. (b) The proposed Contextual Feature Extractor summarizes the whole text information, extracts global contextual information, and predicts the frame-aware latent prior for the U-Net.
  • Figure 5: Comparison of story visualization results between Make-A-Story, AR-LDM, and our proposed StoryImager on Flintstones-SV and Pororo-SV datasets.
  • ...and 2 more figures