Table of Contents
Fetching ...

Narrative Weaver: Towards Controllable Long-Range Visual Consistency with Multi-Modal Conditioning

Zhengjian Yao, Yongzhi Li, Xinyuan Gao, Quan Chen, Peng Jiang, Yanye Lu

TL;DR

Through extensive experiments across three distinct scenarios (controllable multi-scene generation, autonomous storytelling, and e-commerce advertising), the Narrative Weaver method's superiority is demonstrated while opening new possibilities for AI-driven content creation.

Abstract

We present "Narrative Weaver", a novel framework that addresses a fundamental challenge in generative AI: achieving multi-modal controllable, long-range, and consistent visual content generation. While existing models excel at generating high-fidelity short-form visual content, they struggle to maintain narrative coherence and visual consistency across extended sequences - a critical limitation for real-world applications such as filmmaking and e-commerce advertising. Narrative Weaver introduces the first holistic solution that seamlessly integrates three essential capabilities: fine-grained control, automatic narrative planning, and long-range coherence. Our architecture combines a Multimodal Large Language Model (MLLM) for high-level narrative planning with a novel fine-grained control module featuring a dynamic Memory Bank that prevents visual drift. To enable practical deployment, we develop a progressive, multi-stage training strategy that efficiently leverages existing pre-trained models, achieving state-of-the-art performance even with limited training data. Recognizing the absence of suitable evaluation benchmarks, we construct and release the E-commerce Advertising Video Storyboard Dataset (EAVSD) - the first comprehensive dataset for this task, containing over 330K high-quality images with rich narrative annotations. Through extensive experiments across three distinct scenarios (controllable multi-scene generation, autonomous storytelling, and e-commerce advertising), we demonstrate our method's superiority while opening new possibilities for AI-driven content creation.

Narrative Weaver: Towards Controllable Long-Range Visual Consistency with Multi-Modal Conditioning

TL;DR

Through extensive experiments across three distinct scenarios (controllable multi-scene generation, autonomous storytelling, and e-commerce advertising), the Narrative Weaver method's superiority is demonstrated while opening new possibilities for AI-driven content creation.

Abstract

We present "Narrative Weaver", a novel framework that addresses a fundamental challenge in generative AI: achieving multi-modal controllable, long-range, and consistent visual content generation. While existing models excel at generating high-fidelity short-form visual content, they struggle to maintain narrative coherence and visual consistency across extended sequences - a critical limitation for real-world applications such as filmmaking and e-commerce advertising. Narrative Weaver introduces the first holistic solution that seamlessly integrates three essential capabilities: fine-grained control, automatic narrative planning, and long-range coherence. Our architecture combines a Multimodal Large Language Model (MLLM) for high-level narrative planning with a novel fine-grained control module featuring a dynamic Memory Bank that prevents visual drift. To enable practical deployment, we develop a progressive, multi-stage training strategy that efficiently leverages existing pre-trained models, achieving state-of-the-art performance even with limited training data. Recognizing the absence of suitable evaluation benchmarks, we construct and release the E-commerce Advertising Video Storyboard Dataset (EAVSD) - the first comprehensive dataset for this task, containing over 330K high-quality images with rich narrative annotations. Through extensive experiments across three distinct scenarios (controllable multi-scene generation, autonomous storytelling, and e-commerce advertising), we demonstrate our method's superiority while opening new possibilities for AI-driven content creation.
Paper Structure (24 sections, 6 equations, 16 figures, 6 tables)

This paper contains 24 sections, 6 equations, 16 figures, 6 tables.

Figures (16)

  • Figure 1: Narrative Weaver Overview. (a) Narrative Weaver Framework: This system utilizes a hybrid design that integrates Autoregressive (AR) and Diffusion models. The bottom panel illustrates a Multimodal Large Language Model (MLLM) acting as the AR model, responsible for generating narrative plans in textual form and encoding historical information into learnable queries. During the diffusion generation stage, a dynamic memory bank encodes initial conditions and prior outputs to prevent visual content drift. (b) Memory Bank: We employ a series-based decay of prior visual feature length to ensure a bounded total memory length. (c) Attention Mask: A specially designed Attention Mask ensures efficient training, where gray areas are ignored during processing.
  • Figure 2: Qualitative results of consistent visual generation. (a) Narrative Weaver produces visually coherent frames that preserve both stylistic and semantic alignment with the given prompts, while effectively advancing the cinematic story progression. (b) Our model maintains environmental consistency conditioned on the input image and achieves more natural visual transitions compared to other methods.
  • Figure 3: Flux.1-Kontext tend to exhibit “copy–paste” behavior when failing to interpret instructions, resulting in a misleading appearance of high consistency.
  • Figure 4: User Study Results: Model Preference Distribution. The results were aggregated from over 180 responses, each representing user's selection of the most preferred output.
  • Figure 5: Qualitative results of autonomous narrative planning. Narrative Weaver demonstrates a dual capability: maintaining robust visual consistency while also employing fundamental cinematic techniques. The figure showcases examples where the model autonomously plans and generates contextually appropriate subsequent shots that adhere to standard conventions, including cut-ins for detail, cross-cuts for parallel action, and so on.
  • ...and 11 more figures