Table of Contents
Fetching ...

Infinite-Story: A Training-Free Consistent Text-to-Image Generation

Jihun Park, Kyoungmin Lee, Jongmin Gim, Hyeonseo Jo, Minseok Oh, Wonhyeok Choi, Kyumin Hwang, Jaeyeul Kim, Minwoo Choi, Sunghoon Im

TL;DR

Infinite-Story introduces a training-free, scale-wise autoregressive framework for consistent text-to-image generation across multiple prompts in visual storytelling. It achieves cross-image identity and style coherence through Identity Prompt Replacement and a Unified Attention Guidance mechanism comprising Adaptive Style Injection and Synchronized Guidance Adaptation, all at test time without fine-tuning. Experimental results on ConsiStory+ demonstrate state-of-the-art harmonic consistency with strong identity and style metrics, while delivering ~1.72 seconds per image—over 6× faster than diffusion-based consistency methods. The approach enables practical, real-time multi-prompt narratives and lays groundwork for future temporal consistency and adaptive anchor strategies in video-like generation.

Abstract

We present Infinite-Story, a training-free framework for consistent text-to-image (T2I) generation tailored for multi-prompt storytelling scenarios. Built upon a scale-wise autoregressive model, our method addresses two key challenges in consistent T2I generation: identity inconsistency and style inconsistency. To overcome these issues, we introduce three complementary techniques: Identity Prompt Replacement, which mitigates context bias in text encoders to align identity attributes across prompts; and a unified attention guidance mechanism comprising Adaptive Style Injection and Synchronized Guidance Adaptation, which jointly enforce global style and identity appearance consistency while preserving prompt fidelity. Unlike prior diffusion-based approaches that require fine-tuning or suffer from slow inference, Infinite-Story operates entirely at test time, delivering high identity and style consistency across diverse prompts. Extensive experiments demonstrate that our method achieves state-of-the-art generation performance, while offering over 6X faster inference (1.72 seconds per image) than the existing fastest consistent T2I models, highlighting its effectiveness and practicality for real-world visual storytelling.

Infinite-Story: A Training-Free Consistent Text-to-Image Generation

TL;DR

Infinite-Story introduces a training-free, scale-wise autoregressive framework for consistent text-to-image generation across multiple prompts in visual storytelling. It achieves cross-image identity and style coherence through Identity Prompt Replacement and a Unified Attention Guidance mechanism comprising Adaptive Style Injection and Synchronized Guidance Adaptation, all at test time without fine-tuning. Experimental results on ConsiStory+ demonstrate state-of-the-art harmonic consistency with strong identity and style metrics, while delivering ~1.72 seconds per image—over 6× faster than diffusion-based consistency methods. The approach enables practical, real-time multi-prompt narratives and lays groundwork for future temporal consistency and adaptive anchor strategies in video-like generation.

Abstract

We present Infinite-Story, a training-free framework for consistent text-to-image (T2I) generation tailored for multi-prompt storytelling scenarios. Built upon a scale-wise autoregressive model, our method addresses two key challenges in consistent T2I generation: identity inconsistency and style inconsistency. To overcome these issues, we introduce three complementary techniques: Identity Prompt Replacement, which mitigates context bias in text encoders to align identity attributes across prompts; and a unified attention guidance mechanism comprising Adaptive Style Injection and Synchronized Guidance Adaptation, which jointly enforce global style and identity appearance consistency while preserving prompt fidelity. Unlike prior diffusion-based approaches that require fine-tuning or suffer from slow inference, Infinite-Story operates entirely at test time, delivering high identity and style consistency across diverse prompts. Extensive experiments demonstrate that our method achieves state-of-the-art generation performance, while offering over 6X faster inference (1.72 seconds per image) than the existing fastest consistent T2I models, highlighting its effectiveness and practicality for real-world visual storytelling.

Paper Structure

This paper contains 34 sections, 5 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Results of Infinite-Story. Given text prompts with diverse expressions sharing identity prompts, our method generates image sequences with consistent subject identity and style.
  • Figure 2: Qualitative comparison with 1Prompt1Story liu2025onepromptonestory. While 1Prompt1Story maintains a consistent identity across the sequence, it fails to preserve visual style consistency, resulting in noticeable differences in rendering, background, and color tone across images. In contrast, our Infinite-Story achieves both identity and style consistency, producing a coherent visual narrative with uniform illustration style and subject appearance throughout the generated images.
  • Figure 3: Comparison of inference time and harmonic score $S_H$ between our method and state-of-the-art identity-consistent text-to-image generation models.
  • Figure 4: Overall pipeline of our method. The text encoder $E_T$chung2022scalinginstructionfinetunedlanguagemodels processes a set of text prompts $\mathbf{t}$, producing contextual embeddings $\mathbf{T}$ that condition the transformer. Identity Prompt Replacement is applied to $\mathbf{T}$ before generation to ensure consistent identity representation across prompts. During generation, Unified Attention Guidance (UAG), which consists of Adaptive Style Injection and Synchronized Guidance Adaptation, is applied to early-stage self-attention layers to achieve consistent identity appearance and overall style alignment while preserving prompt fidelity. The transformer autoregressively produces residual feature maps, which are decoded into final images $\mathbf{I}$ via the image decoder.
  • Figure 5: Context-bias in text-to-image generation.
  • ...and 8 more figures