Table of Contents
Fetching ...

ContextualStory: Consistent Visual Storytelling with Spatially-Enhanced and Storyline Context

Sixiao Zheng, Yanwei Fu

TL;DR

ContextualStory tackles the memory and context limitations of autoregressive visual storytelling by introducing Spatially-Enhanced Temporal Attention (SETA), a Storyline Contextualizer (SC), and a StoryFlow Adapter within a diffusion-based framework. The non-autoregressive design enables efficient, globally coherent story frame generation and continuation, with explicit handling of character movement and scene changes. Extensive experiments on PororoSV and FlintstonesSV show state-of-the-art performance for both story visualization and continuation, along with favorable memory and speed characteristics. The work advances practical visual storytelling by tightly integrating spatial-temporal modeling with storyline-aware context propagation, offering a scalable path toward coherent multi-frame narratives.

Abstract

Visual storytelling involves generating a sequence of coherent frames from a textual storyline while maintaining consistency in characters and scenes. Existing autoregressive methods, which rely on previous frame-sentence pairs, struggle with high memory usage, slow generation speeds, and limited context integration. To address these issues, we propose ContextualStory, a novel framework designed to generate coherent story frames and extend frames for visual storytelling. ContextualStory utilizes Spatially-Enhanced Temporal Attention to capture spatial and temporal dependencies, handling significant character movements effectively. Additionally, we introduce a Storyline Contextualizer to enrich context in storyline embedding, and a StoryFlow Adapter to measure scene changes between frames for guiding the model. Extensive experiments on PororoSV and FlintstonesSV datasets demonstrate that ContextualStory significantly outperforms existing SOTA methods in both story visualization and continuation. Code is available at https://github.com/sixiaozheng/ContextualStory.

ContextualStory: Consistent Visual Storytelling with Spatially-Enhanced and Storyline Context

TL;DR

ContextualStory tackles the memory and context limitations of autoregressive visual storytelling by introducing Spatially-Enhanced Temporal Attention (SETA), a Storyline Contextualizer (SC), and a StoryFlow Adapter within a diffusion-based framework. The non-autoregressive design enables efficient, globally coherent story frame generation and continuation, with explicit handling of character movement and scene changes. Extensive experiments on PororoSV and FlintstonesSV show state-of-the-art performance for both story visualization and continuation, along with favorable memory and speed characteristics. The work advances practical visual storytelling by tightly integrating spatial-temporal modeling with storyline-aware context propagation, offering a scalable path toward coherent multi-frame narratives.

Abstract

Visual storytelling involves generating a sequence of coherent frames from a textual storyline while maintaining consistency in characters and scenes. Existing autoregressive methods, which rely on previous frame-sentence pairs, struggle with high memory usage, slow generation speeds, and limited context integration. To address these issues, we propose ContextualStory, a novel framework designed to generate coherent story frames and extend frames for visual storytelling. ContextualStory utilizes Spatially-Enhanced Temporal Attention to capture spatial and temporal dependencies, handling significant character movements effectively. Additionally, we introduce a Storyline Contextualizer to enrich context in storyline embedding, and a StoryFlow Adapter to measure scene changes between frames for guiding the model. Extensive experiments on PororoSV and FlintstonesSV datasets demonstrate that ContextualStory significantly outperforms existing SOTA methods in both story visualization and continuation. Code is available at https://github.com/sixiaozheng/ContextualStory.
Paper Structure (23 sections, 3 equations, 15 figures, 12 tables)

This paper contains 23 sections, 3 equations, 15 figures, 12 tables.

Figures (15)

  • Figure 1: Story frames generated by our ContextualStory on PororoSV dataset. Red circles highlight character inconsistencies, and blue circles indicate repeated characters. SETA and SC enhance character consistency and scene coherence, achieving superior results compared to AR-LDM.
  • Figure 2: Architecture of ContextualStory for Story Visualization. Each UNet block includes temporal convolution and Spatially-Enhanced Temporal Attention to effectively capture complex spatial and temporal dependencies. The Storyline Contextualizer enriches the storyline embedding by integrating context information from all text embeddings, while the StoryFlow Adapter measures scene changes by computing differences between adjacent frames.
  • Figure 3: Spatially-Enhanced Temporal Attention leverages a local window mechanism across frames to capture both spatial and temporal dependencies, effectively handling significant character movements.
  • Figure 4: Architecture of ContextualStory for Story Continuation. The first frame latent is used as additional input for all UNet blocks, resized and adjusted with a $1 \times 1$ convolution layer before concatenation with the hidden state.
  • Figure 5: Qualitative comparison of story visualization on PororoSV (left) and FlintstonesSV (right).
  • ...and 10 more figures