Table of Contents
Fetching ...

LogiStory: A Logic-Aware Framework for Multi-Image Story Visualization

Chutian Meng, Fan Ma, Chi Zhang, Jiaxu Miao, Yi Yang, Yueting Zhuang

Abstract

Generating coherent and communicative visual sequences, such as image sequences and videos, remains a significant challenge for current multimodal systems. Despite advances in visual quality and the integration of world knowledge, existing models still struggle to maintain logical flow, often resulting in disjointed actions, fragmented narratives, and unclear storylines. We attribute these issues to the lack of attention to visual logic, a critical yet underexplored dimension of visual sequence generation that we define as the perceptual and causal coherence among characters, actions, and scenes over time. To bridge this gap, we propose a logic-aware multi-image story visualization framework, LogiStory. The framework is built around the central innovation of explicitly modeling visual logic in story visualization. To realize this idea, we design a multi-agent system that grounds roles, extracts causal chains, and verifies story-level consistency, transforming narrative coherence from an implicit byproduct of image generation into an explicit modeling objective. This design effectively bridges structured story planning with visual generation, enhancing both narrative clarity and visual quality in story visualization. Furthermore, to evaluate the generation capacity, we construct LogicTale, a benchmark comprising richly annotated stories, emphasizing causal reasoning, and visual logic interpretability. We establish comprehensive automatic and human evaluation protocols designed to measure both visual logic and perceptual quality. Experiments demonstrate that our approach significantly improves the narrative logic of generated visual stories. This work provides a foundational step towards modeling and enforcing visual logic in general image sequence and video generation tasks.

LogiStory: A Logic-Aware Framework for Multi-Image Story Visualization

Abstract

Generating coherent and communicative visual sequences, such as image sequences and videos, remains a significant challenge for current multimodal systems. Despite advances in visual quality and the integration of world knowledge, existing models still struggle to maintain logical flow, often resulting in disjointed actions, fragmented narratives, and unclear storylines. We attribute these issues to the lack of attention to visual logic, a critical yet underexplored dimension of visual sequence generation that we define as the perceptual and causal coherence among characters, actions, and scenes over time. To bridge this gap, we propose a logic-aware multi-image story visualization framework, LogiStory. The framework is built around the central innovation of explicitly modeling visual logic in story visualization. To realize this idea, we design a multi-agent system that grounds roles, extracts causal chains, and verifies story-level consistency, transforming narrative coherence from an implicit byproduct of image generation into an explicit modeling objective. This design effectively bridges structured story planning with visual generation, enhancing both narrative clarity and visual quality in story visualization. Furthermore, to evaluate the generation capacity, we construct LogicTale, a benchmark comprising richly annotated stories, emphasizing causal reasoning, and visual logic interpretability. We establish comprehensive automatic and human evaluation protocols designed to measure both visual logic and perceptual quality. Experiments demonstrate that our approach significantly improves the narrative logic of generated visual stories. This work provides a foundational step towards modeling and enforcing visual logic in general image sequence and video generation tasks.

Paper Structure

This paper contains 68 sections, 7 equations, 13 figures, 7 tables, 1 algorithm.

Figures (13)

  • Figure 1: Comparison of the state-of-the-art multimodal models alongside our proposed approach LogiStory in the visualization of the simple story "The Crow and the Pitcher." The results highlight the challenges of visual reasoning in the process of visual sequence generation, while demonstrating the effectiveness of LogiStory.
  • Figure 2: Overview of LogiStory framework. Given an input story, our system first applies a multi-agent story planner to decompose the story into structured panels with detailed scripts. In the generation process, the Local Causal Monitor simulates a reader’s linear understanding by evaluating each frame for inconsistencies and generating refinement signals. Then, the Global Causal Verifier applies the causal graph to produce concrete refinement instructions to correct errors and maintain narrative flow.
  • Figure 3: Quality analysis of generated stories across different methods. Representative key scenes are shown here. Complete sequences can be found in the Appendix \ref{['app:qualitative']}.
  • Figure 4: Ablation study on key components of LogiStory. Qualitative comparisons on representative examples demonstrate the impact of each module.
  • Figure 5: Quality analysis on Complex Case.Memory Lasts: An old painter takes his student to a distant sunflower field, once shared with his younger brother before the war. As they paint, the narrative shifts briefly to a childhood flashback, two boys painting under the sunflower sky, one waving goodbye. Upon returning home, the student donates a commemorative painting to a local school, where it remains as the old painter’s lasting memory.
  • ...and 8 more figures