Table of Contents
Fetching ...

Directing the Narrative: A Finetuning Method for Controlling Coherence and Style in Story Generation

Jianzhang Zhang, Yijing Tian, Jiwang Qu, Chuang Liu

Abstract

Story visualization requires generating sequential imagery that aligns semantically with evolving narratives while maintaining rigorous consistency in character identity and visual style. However, existing methodologies often struggle with subject inconsistency and identity drift, particularly when depicting complex interactions or extended narrative arcs. To address these challenges, we propose a cohesive two-stage framework designed for robust and consistent story generation. First, we introduce Group-Shared Attention (GSA), a mechanism that fosters intrinsic consistency by enabling lossless cross-sample information flow within attention layers. This allows the model to structurally encode identity correspondence across frames without relying on external encoders. Second, we leverage Direct Preference Optimization (DPO) to align generated outputs with human aesthetic and narrative standards. Unlike conventional methods that rely on conflicting auxiliary losses, our approach simultaneously enhances visual fidelity and identity preservation by learning from holistic preference data. Extensive evaluations on the ViStoryBench benchmark demonstrate that our method establishes a new state-of-the-art, significantly outperforming strong baselines with gains of +10.0 in Character Identity (CIDS) and +18.7 in Style Consistency (CSD), all while preserving high-fidelity generation.

Directing the Narrative: A Finetuning Method for Controlling Coherence and Style in Story Generation

Abstract

Story visualization requires generating sequential imagery that aligns semantically with evolving narratives while maintaining rigorous consistency in character identity and visual style. However, existing methodologies often struggle with subject inconsistency and identity drift, particularly when depicting complex interactions or extended narrative arcs. To address these challenges, we propose a cohesive two-stage framework designed for robust and consistent story generation. First, we introduce Group-Shared Attention (GSA), a mechanism that fosters intrinsic consistency by enabling lossless cross-sample information flow within attention layers. This allows the model to structurally encode identity correspondence across frames without relying on external encoders. Second, we leverage Direct Preference Optimization (DPO) to align generated outputs with human aesthetic and narrative standards. Unlike conventional methods that rely on conflicting auxiliary losses, our approach simultaneously enhances visual fidelity and identity preservation by learning from holistic preference data. Extensive evaluations on the ViStoryBench benchmark demonstrate that our method establishes a new state-of-the-art, significantly outperforming strong baselines with gains of +10.0 in Character Identity (CIDS) and +18.7 in Style Consistency (CSD), all while preserving high-fidelity generation.
Paper Structure (49 sections, 8 equations, 6 figures, 2 tables)

This paper contains 49 sections, 8 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Visual showcase of consistent character generation. The leftmost column displays the reference input. Subsequent columns demonstrate our model's capability to synthesize the same character in various complex scenarios, maintaining strictly consistent identity and style.
  • Figure 2: Illustration of the data construction process. The pipeline transforms raw data from storybooks and films into character-consistent sequences.
  • Figure 3: The pipeline of our two-stage approach. Stage 1 performs Consistency Pre-training by injecting Group-Shared Attention (GSA) into the adapter $\Phi^c$. Stage 2 executes Preference Alignment via DPO (adapter $\Phi^d$) to mitigate semantic drift and enhance fidelity.
  • Figure 4: Qualitative comparison on ViStoryBench. Left: A narrative in Flat 2D Illustration style. Right: A narrative in Western Animation style. Unlike baselines that suffer from severe style drift, our method (Bottom Row) faithfully preserves both the artistic style and character identity.
  • Figure 5: User Study Results on a 5-point Likert Scale. We compare human preference scores across Character Consistency (CC), Style Consistency (SC), and Subjective Aesthetics (SA). Our method (Red) consistently outperforms all baselines by a significant margin, particularly in identity preservation.
  • ...and 1 more figures