Table of Contents
Fetching ...

Storybooth: Training-free Multi-Subject Consistency for Improved Visual Storytelling

Jaskirat Singh, Junshen Kevin Chen, Jonas Kohler, Michael Cohen

TL;DR

This work tackles training-free, consistent visual storytelling across multiple storyboard frames, where prior cross-frame self-attention methods suffer from inter-character leakage when handling several subjects. It introduces StoryBooth, which fuses region-based storyboard planning with a bounded cross-frame self-attention layer and a cross-frame token-merging module inside a diffusion model, complemented by an early negative token unmerging strategy to boost pose variance. Key contributions include identifying inter-character leakage as the core limitation, proposing region localization plus bounded attention and token merging to address it, and delivering quantitative and qualitative gains in multi-character consistency and text-to-image alignment with significantly faster inference than optimization-based approaches. The approach enables robust, training-free multi-character storytelling applicable to visual storytelling and multi-shot video generation, offering practical speedups and improved coherence across frames.

Abstract

Training-free consistent text-to-image generation depicting the same subjects across different images is a topic of widespread recent interest. Existing works in this direction predominantly rely on cross-frame self-attention; which improves subject-consistency by allowing tokens in each frame to pay attention to tokens in other frames during self-attention computation. While useful for single subjects, we find that it struggles when scaling to multiple characters. In this work, we first analyze the reason for these limitations. Our exploration reveals that the primary-issue stems from self-attention-leakage, which is exacerbated when trying to ensure consistency across multiple-characters. This happens when tokens from one subject pay attention to other characters, causing them to appear like each other (e.g., a dog appearing like a duck). Motivated by these findings, we propose StoryBooth: a training-free approach for improving multi-character consistency. In particular, we first leverage multi-modal chain-of-thought reasoning and region-based generation to apriori localize the different subjects across the desired story outputs. The final outputs are then generated using a modified diffusion model which consists of two novel layers: 1) a bounded cross-frame self-attention layer for reducing inter-character attention leakage, and 2) token-merging layer for improving consistency of fine-grain subject details. Through both qualitative and quantitative results we find that the proposed approach surpasses prior state-of-the-art, exhibiting improved consistency across both multiple-characters and fine-grain subject details.

Storybooth: Training-free Multi-Subject Consistency for Improved Visual Storytelling

TL;DR

This work tackles training-free, consistent visual storytelling across multiple storyboard frames, where prior cross-frame self-attention methods suffer from inter-character leakage when handling several subjects. It introduces StoryBooth, which fuses region-based storyboard planning with a bounded cross-frame self-attention layer and a cross-frame token-merging module inside a diffusion model, complemented by an early negative token unmerging strategy to boost pose variance. Key contributions include identifying inter-character leakage as the core limitation, proposing region localization plus bounded attention and token merging to address it, and delivering quantitative and qualitative gains in multi-character consistency and text-to-image alignment with significantly faster inference than optimization-based approaches. The approach enables robust, training-free multi-character storytelling applicable to visual storytelling and multi-shot video generation, offering practical speedups and improved coherence across frames.

Abstract

Training-free consistent text-to-image generation depicting the same subjects across different images is a topic of widespread recent interest. Existing works in this direction predominantly rely on cross-frame self-attention; which improves subject-consistency by allowing tokens in each frame to pay attention to tokens in other frames during self-attention computation. While useful for single subjects, we find that it struggles when scaling to multiple characters. In this work, we first analyze the reason for these limitations. Our exploration reveals that the primary-issue stems from self-attention-leakage, which is exacerbated when trying to ensure consistency across multiple-characters. This happens when tokens from one subject pay attention to other characters, causing them to appear like each other (e.g., a dog appearing like a duck). Motivated by these findings, we propose StoryBooth: a training-free approach for improving multi-character consistency. In particular, we first leverage multi-modal chain-of-thought reasoning and region-based generation to apriori localize the different subjects across the desired story outputs. The final outputs are then generated using a modified diffusion model which consists of two novel layers: 1) a bounded cross-frame self-attention layer for reducing inter-character attention leakage, and 2) token-merging layer for improving consistency of fine-grain subject details. Through both qualitative and quantitative results we find that the proposed approach surpasses prior state-of-the-art, exhibiting improved consistency across both multiple-characters and fine-grain subject details.

Paper Structure

This paper contains 10 sections, 7 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Overview. We propose a training and optimization-free approach for improving consistency across both multiple characters (dog and duck: left) and fine-grain subject details (face of dog: right).
  • Figure 2: Identifying key consistency problems with prior state-of-the-art (storydiffusion; left and consistory: right). We find that naive cross-frame self-attention struggles with two main limitations: 1) Intercharacter-leakage causing the features of different characters to mix (e.g., rabbit and cat, cat and dog), 2) lack of fine-grain consistency for subject details (body of bird, wings of duck).
  • Figure 3: Analyzing Inter-character Leakage (a,b) and the motivation for proposed approach (c).
  • Figure 4: Method Overview. We propose a training-free approach for improving storyboard consistency across multiple subjects. The core idea of our approach is to combine region-based generation and a novel bounded self-attention layer for reducing inter-character leakage. We first use region-based planning and generation to apriori localize different subjects (Sec. \ref{['sec:region-based-generation']}). The output images are then generated using a modified diffusion model which consists of 1) bounded cross-frame self-attention layer (Sec. \ref{['sec:intercharacter-leakage']}) to limit the self-attention of each token to only the regions of the same subject (e.g., boy in above) across the storyboard, and, 2) cross-frame token-merging layer which uses the attention map from the self-attention layer to align fine-grain subject details (Sec. \ref{['sec:token-merging']}).
  • Figure 5: Understanding role of bounded self-attention and dropout. Naive masking of the self-attention tokens using region constraints leads to reduced intercharacter leakage at cost of image quality. To address this, we show that combined naive masking with a dropout-based bias term for self-attention computation allows for reduced inter-character leakage while preserving image quality.
  • ...and 4 more figures