Table of Contents
Fetching ...

StoryBlender: Inter-Shot Consistent and Editable 3D Storyboard with Spatial-temporal Dynamics

Bingliang Li, Zhenhong Sun, Jiaming Bian, Yuehao Wu, Yifu Wang, Hongdong Li, Yatao Bian, Huadong Mo, Daoyi Dong

Abstract

Storyboarding is a core skill in visual storytelling for film, animation, and games. However, automating this process requires a system to achieve two properties that current approaches rarely satisfy simultaneously: inter-shot consistency and explicit editability. While 2D diffusion-based generators produce vivid imagery, they often suffer from identity drift along with limited geometric control; conversely, traditional 3D animation workflows are consistent and editable but require expert-heavy, labor-intensive authoring. We present StoryBlender, a grounded 3D storyboard generation framework governed by a Story-centric Reflection Scheme. At its core, we propose the StoryBlender system, which is built on a three-stage pipeline: (1) Semantic-Spatial Grounding, to construct a continuity memory graph to decouple global assets from shot-specific variables for long-horizon consistency; (2) Canonical Asset Materialization, to instantiate entities in a unified coordinate space to maintain visual identity; and (3) Spatial-Temporal Dynamics, to achieve layout design and cinematic evolution through visual metrics. By orchestrating multiple agents in a hierarchical manner within a verification loop, StoryBlender iteratively self-corrects spatial hallucinations via engine-verified feedback. The resulting native 3D scenes support direct, precise editing of cameras and visual assets while preserving unwavering multi-shot continuity. Experiments demonstrate that StoryBlender significantly improves consistency and editability over both diffusion-based and 3D-grounded baselines. Code, data, and demonstration video will be available on https://engineeringai-lab.github.io/StoryBlender/

StoryBlender: Inter-Shot Consistent and Editable 3D Storyboard with Spatial-temporal Dynamics

Abstract

Storyboarding is a core skill in visual storytelling for film, animation, and games. However, automating this process requires a system to achieve two properties that current approaches rarely satisfy simultaneously: inter-shot consistency and explicit editability. While 2D diffusion-based generators produce vivid imagery, they often suffer from identity drift along with limited geometric control; conversely, traditional 3D animation workflows are consistent and editable but require expert-heavy, labor-intensive authoring. We present StoryBlender, a grounded 3D storyboard generation framework governed by a Story-centric Reflection Scheme. At its core, we propose the StoryBlender system, which is built on a three-stage pipeline: (1) Semantic-Spatial Grounding, to construct a continuity memory graph to decouple global assets from shot-specific variables for long-horizon consistency; (2) Canonical Asset Materialization, to instantiate entities in a unified coordinate space to maintain visual identity; and (3) Spatial-Temporal Dynamics, to achieve layout design and cinematic evolution through visual metrics. By orchestrating multiple agents in a hierarchical manner within a verification loop, StoryBlender iteratively self-corrects spatial hallucinations via engine-verified feedback. The resulting native 3D scenes support direct, precise editing of cameras and visual assets while preserving unwavering multi-shot continuity. Experiments demonstrate that StoryBlender significantly improves consistency and editability over both diffusion-based and 3D-grounded baselines. Code, data, and demonstration video will be available on https://engineeringai-lab.github.io/StoryBlender/

Paper Structure

This paper contains 43 sections, 15 equations, 22 figures, 5 tables.

Figures (22)

  • Figure 1: Overview of StoryBlender compared to existing storyboarding methods.Left: diffusion-based generation in pixel space; improving consistency typically requires reference inputs. Middle: traditional 3D workflow; strong control but a complex, labor-intensive pipeline. Right (ours): StoryBlender uses a hierarchical multi-agent planning framework to create consistent, editable 3D storyboards across shots.
  • Figure 2: Hierarchical Multi-Agent Planning Framework. Governed by a Story-centric Reflection Scheme (b), our system utilizes iterative feedback from 3D engines (e.g., Blender) and Vision-Language Models to ensure geometric and narrative consistency. We translate narrative $\mathcal{T}_{story}$ into 3D storyboards $\mathcal{V}_{3D}$ via a three-stage pipeline: (a) Semantic-Spatial Grounding, where the Director Agent decomposes the story into a structured continuity memory graph ($\bm{\mathcal{G}_{cm}}$) to ensure precise information flow to downstream agents; (c) Canonical Asset Materialization, which instantiates entities from $\bm{\mathcal{G}_{cm}}$ to maintain global asset consistency; and (d) Spatial-Temporal Dynamics, which performs spatial layout of assets from memory and enhances cinematic visual effects. (Details of all agents are provided in Appendix.)
  • Figure 3: Comparison of baselines on a complex multi-shot sequence from the film Casablanca. StoryBlender demonstrates stronger geometric consistency and entity management across shots, maintaining the architectural layout and correct character count in each frame. In contrast, StoryDiffusion and Story2Board capture the general semantic atmosphere but exhibit spatial inconsistencies, hallucinating background changes between camera cuts and failing to preserve the correct number of characters. (More Stories and shots are presented in the Appendix.)
  • Figure 4: Impact of Story-centric Reflection for the Concept Artist and Visual Effects Artist Agents.
  • Figure 4: Spatial error over the 5 reflection turns for naive and physical methods. D: Direction, R: Relationship, O: Occlusion, C: Contact.
  • ...and 17 more figures