StoryImager: A Unified and Efficient Framework for Coherent Story Visualization and Completion
Ming Tao, Bing-Kun Bao, Hao Tang, Yaowei Wang, Changsheng Xu
TL;DR
This work tackles coherent story visualization and completion by addressing limitations of autoregressive frame-by-frame generation, heavy history encoders, and disjoint task training. It introduces StoryImager, a unified framework that uses Storyboard-Gen for bidirectional synthesis, a Target Frame Masking strategy to unify visualization and completion, and a Frame-Story Cross Attention Module along with a Contextual Feature Extractor to ensure local frame fidelity and global story coherence, all powered by a PEFT-enabled diffusion backbone. Experiments on Pororo-SV and Flintstones-SV show that StoryImager achieves superior FID and FSD scores, enables story completion, and reduces hardware and time requirements compared to AR-LDM, with strong human evaluation results confirming improved visual quality, consistency, and relevance. The approach yields a practical, versatile tool for coherent multi-frame storytelling with efficient use of pre-trained diffusion models and parameter-efficient fine-tuning, benefiting applications in entertainment, education, and multimedia storytelling.
Abstract
Story visualization aims to generate a series of realistic and coherent images based on a storyline. Current models adopt a frame-by-frame architecture by transforming the pre-trained text-to-image model into an auto-regressive manner. Although these models have shown notable progress, there are still three flaws. 1) The unidirectional generation of auto-regressive manner restricts the usability in many scenarios. 2) The additional introduced story history encoders bring an extremely high computational cost. 3) The story visualization and continuation models are trained and inferred independently, which is not user-friendly. To these ends, we propose a bidirectional, unified, and efficient framework, namely StoryImager. The StoryImager enhances the storyboard generative ability inherited from the pre-trained text-to-image model for a bidirectional generation. Specifically, we introduce a Target Frame Masking Strategy to extend and unify different story image generation tasks. Furthermore, we propose a Frame-Story Cross Attention Module that decomposes the cross attention for local fidelity and global coherence. Moreover, we design a Contextual Feature Extractor to extract contextual information from the whole storyline. The extensive experimental results demonstrate the excellent performance of our StoryImager. The code is available at https://github.com/tobran/StoryImager.
