Table of Contents
Fetching ...

Object Isolated Attention for Consistent Story Visualization

Xiangyang Luo, Junhao Cheng, Yifan Xie, Xin Zhang, Tao Feng, Zhou Liu, Fei Ma, Fei Yu

TL;DR

The paper tackles open-ended story visualization, where generating coherent image sequences with consistent character identity is challenging. It introduces a training-free extended Transformer block with isolated self- and cross-attention that leverages per-character masks and diffusion layout priors. A prompting pipeline via LLMs and a mask-based isolation mechanism reduces feature leakage and enforces character fidelity across scenes. Experimental results on a multi-turn dataset show improvements in character consistency and image quality over baselines, indicating practical value for scalable, coherent visual storytelling and potential extensions to video and 3D content.

Abstract

Open-ended story visualization is a challenging task that involves generating coherent image sequences from a given storyline. One of the main difficulties is maintaining character consistency while creating natural and contextually fitting scenes--an area where many existing methods struggle. In this paper, we propose an enhanced Transformer module that uses separate self attention and cross attention mechanisms, leveraging prior knowledge from pre-trained diffusion models to ensure logical scene creation. The isolated self attention mechanism improves character consistency by refining attention maps to reduce focus on irrelevant areas and highlight key features of the same character. Meanwhile, the isolated cross attention mechanism independently processes each character's features, avoiding feature fusion and further strengthening consistency. Notably, our method is training-free, allowing the continuous generation of new characters and storylines without re-tuning. Both qualitative and quantitative evaluations show that our approach outperforms current methods, demonstrating its effectiveness.

Object Isolated Attention for Consistent Story Visualization

TL;DR

The paper tackles open-ended story visualization, where generating coherent image sequences with consistent character identity is challenging. It introduces a training-free extended Transformer block with isolated self- and cross-attention that leverages per-character masks and diffusion layout priors. A prompting pipeline via LLMs and a mask-based isolation mechanism reduces feature leakage and enforces character fidelity across scenes. Experimental results on a multi-turn dataset show improvements in character consistency and image quality over baselines, indicating practical value for scalable, coherent visual storytelling and potential extensions to video and 3D content.

Abstract

Open-ended story visualization is a challenging task that involves generating coherent image sequences from a given storyline. One of the main difficulties is maintaining character consistency while creating natural and contextually fitting scenes--an area where many existing methods struggle. In this paper, we propose an enhanced Transformer module that uses separate self attention and cross attention mechanisms, leveraging prior knowledge from pre-trained diffusion models to ensure logical scene creation. The isolated self attention mechanism improves character consistency by refining attention maps to reduce focus on irrelevant areas and highlight key features of the same character. Meanwhile, the isolated cross attention mechanism independently processes each character's features, avoiding feature fusion and further strengthening consistency. Notably, our method is training-free, allowing the continuous generation of new characters and storylines without re-tuning. Both qualitative and quantitative evaluations show that our approach outperforms current methods, demonstrating its effectiveness.

Paper Structure

This paper contains 18 sections, 7 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Pipeline of our framework. Given a story, we utilize an LLM agent to decompose it into scene prompts and character prompts (a). These prompts are then fed into a pre-trained diffusion model to generate anime-style images (b). We replace the traditional upsampling Transformer block with an extended Transformer block, which introduces an extended branch (c). In this new branch, we design isolated self attention (e) and isolated cross attention (d) mechanisms, which extract cross attention maps from the original branch to enhance character consistency and reduce feature confusion.
  • Figure 2: Ablation study of our re-weight operation and the visualization of the isolated self attention map, which reveals that after re-weight, the character's skin tone, hair color, and image style align more closely with the reference image. The attention map also shows increased focus on the reference tokens, with non-gray areas indicating regions masked by the operation described in Sec. \ref{['sec:mask']}.
  • Figure 3: Comparison with common cross attention with our isolated cross attention. Our method accurately isolates the character’s features, preventing confusion between black and white clothing.
  • Figure 4: Qualitative comparison results. Each column of images should match the content of the prompt, and the appearance of characters within all the same-colored bounding boxes in each row should remain consistent. The results demonstrate that our method effectively maintains character consistency and accurately aligns with the prompt content.