Table of Contents
Fetching ...

Consistent Story Generation: Unlocking the Potential of Zigzag Sampling

Mingxiao Li, Mang Ning, Marie-Francine Moens

TL;DR

This work tackles the problem of maintaining consistent subject identity across multi-image visual storytelling with diffusion models. It introduces Asymmetry Zigzag Sampling (AZS), which combines Zig Visual Sharing (ZVS) and Asymmetric Prompt Zigzag Inference (APZI) to inject subject information into latent representations while preserving textual alignment, without fine-tuning. The method operates in three sub-steps (zig, zag, generation) with asymmetric guidance to balance identity fidelity and prompt fidelity, and it demonstrates improved performance across SDXL and FLUX backbones, including a human-preference edge. The approach offers a scalable, training-free path to coherent long-form visual narratives, at the cost of higher inference time, and is validated through extensive quantitative, qualitative, and user studies.

Abstract

Text-to-image generation models have made significant progress in producing high-quality images from textual descriptions, yet they continue to struggle with maintaining subject consistency across multiple images, a fundamental requirement for visual storytelling. Existing methods attempt to address this by either fine-tuning models on large-scale story visualization datasets, which is resource-intensive, or by using training-free techniques that share information across generations, which still yield limited success. In this paper, we introduce a novel training-free sampling strategy called Zigzag Sampling with Asymmetric Prompts and Visual Sharing to enhance subject consistency in visual story generation. Our approach proposes a zigzag sampling mechanism that alternates between asymmetric prompting to retain subject characteristics, while a visual sharing module transfers visual cues across generated images to %further enforce consistency. Experimental results, based on both quantitative metrics and qualitative evaluations, demonstrate that our method significantly outperforms previous approaches in generating coherent and consistent visual stories. The code is available at https://github.com/Mingxiao-Li/Asymmetry-Zigzag-StoryDiffusion.

Consistent Story Generation: Unlocking the Potential of Zigzag Sampling

TL;DR

This work tackles the problem of maintaining consistent subject identity across multi-image visual storytelling with diffusion models. It introduces Asymmetry Zigzag Sampling (AZS), which combines Zig Visual Sharing (ZVS) and Asymmetric Prompt Zigzag Inference (APZI) to inject subject information into latent representations while preserving textual alignment, without fine-tuning. The method operates in three sub-steps (zig, zag, generation) with asymmetric guidance to balance identity fidelity and prompt fidelity, and it demonstrates improved performance across SDXL and FLUX backbones, including a human-preference edge. The approach offers a scalable, training-free path to coherent long-form visual narratives, at the cost of higher inference time, and is validated through extensive quantitative, qualitative, and user studies.

Abstract

Text-to-image generation models have made significant progress in producing high-quality images from textual descriptions, yet they continue to struggle with maintaining subject consistency across multiple images, a fundamental requirement for visual storytelling. Existing methods attempt to address this by either fine-tuning models on large-scale story visualization datasets, which is resource-intensive, or by using training-free techniques that share information across generations, which still yield limited success. In this paper, we introduce a novel training-free sampling strategy called Zigzag Sampling with Asymmetric Prompts and Visual Sharing to enhance subject consistency in visual story generation. Our approach proposes a zigzag sampling mechanism that alternates between asymmetric prompting to retain subject characteristics, while a visual sharing module transfers visual cues across generated images to %further enforce consistency. Experimental results, based on both quantitative metrics and qualitative evaluations, demonstrate that our method significantly outperforms previous approaches in generating coherent and consistent visual stories. The code is available at https://github.com/Mingxiao-Li/Asymmetry-Zigzag-StoryDiffusion.

Paper Structure

This paper contains 19 sections, 3 equations, 11 figures, 5 tables, 2 algorithms.

Figures (11)

  • Figure 1: Visual Story Telling requires maintaining subject consistency across a sequence of generated images while ensuring that each image faithfully reflects the corresponding prompt. (Images generated using the FLUX model with our proposed method.)
  • Figure 2: An illustration of Zigzag Sampling, along with a comparison to previous methods in how semantic information is incorporated during image generation.
  • Figure 3: Overview of our proposed pipeline. (a) Identity-guided diffusion inference, where identity prompts are used to cache identity-related visual tokens. (b) Visual token selection module, which leverages attention scores to identify the most relevant tokens for the subject. (c) Illustration of the asymmetric design applied to zigzag sampling. (d) Integration of identity-aware visual information during the zigzag sampling process.
  • Figure 3: Quantitative comparisons of different zigzag sampling designs.
  • Figure 4: Qualitative Results Using the SDXL Backbone. We compare our method with four baselines: ConsisStory, IP-Adapter, StoryDiffusion, and 1Prompt1Story. The identity prompt is shown at the bottom, while individual image prompts are displayed above each corresponding image. Our method demonstrates a strong balance between maintaining subject consistency and adhering to textual prompts. In contrast, the baseline methods often struggle—either failing to preserve the subject’s identity or deviating from the given text descriptions.
  • ...and 6 more figures