ContextualStory: Consistent Visual Storytelling with Spatially-Enhanced and Storyline Context
Sixiao Zheng, Yanwei Fu
TL;DR
ContextualStory tackles the memory and context limitations of autoregressive visual storytelling by introducing Spatially-Enhanced Temporal Attention (SETA), a Storyline Contextualizer (SC), and a StoryFlow Adapter within a diffusion-based framework. The non-autoregressive design enables efficient, globally coherent story frame generation and continuation, with explicit handling of character movement and scene changes. Extensive experiments on PororoSV and FlintstonesSV show state-of-the-art performance for both story visualization and continuation, along with favorable memory and speed characteristics. The work advances practical visual storytelling by tightly integrating spatial-temporal modeling with storyline-aware context propagation, offering a scalable path toward coherent multi-frame narratives.
Abstract
Visual storytelling involves generating a sequence of coherent frames from a textual storyline while maintaining consistency in characters and scenes. Existing autoregressive methods, which rely on previous frame-sentence pairs, struggle with high memory usage, slow generation speeds, and limited context integration. To address these issues, we propose ContextualStory, a novel framework designed to generate coherent story frames and extend frames for visual storytelling. ContextualStory utilizes Spatially-Enhanced Temporal Attention to capture spatial and temporal dependencies, handling significant character movements effectively. Additionally, we introduce a Storyline Contextualizer to enrich context in storyline embedding, and a StoryFlow Adapter to measure scene changes between frames for guiding the model. Extensive experiments on PororoSV and FlintstonesSV datasets demonstrate that ContextualStory significantly outperforms existing SOTA methods in both story visualization and continuation. Code is available at https://github.com/sixiaozheng/ContextualStory.
