Infinite-Story: A Training-Free Consistent Text-to-Image Generation

Jihun Park; Kyoungmin Lee; Jongmin Gim; Hyeonseo Jo; Minseok Oh; Wonhyeok Choi; Kyumin Hwang; Jaeyeul Kim; Minwoo Choi; Sunghoon Im

Infinite-Story: A Training-Free Consistent Text-to-Image Generation

Jihun Park, Kyoungmin Lee, Jongmin Gim, Hyeonseo Jo, Minseok Oh, Wonhyeok Choi, Kyumin Hwang, Jaeyeul Kim, Minwoo Choi, Sunghoon Im

TL;DR

Infinite-Story introduces a training-free, scale-wise autoregressive framework for consistent text-to-image generation across multiple prompts in visual storytelling. It achieves cross-image identity and style coherence through Identity Prompt Replacement and a Unified Attention Guidance mechanism comprising Adaptive Style Injection and Synchronized Guidance Adaptation, all at test time without fine-tuning. Experimental results on ConsiStory+ demonstrate state-of-the-art harmonic consistency with strong identity and style metrics, while delivering ~1.72 seconds per image—over 6× faster than diffusion-based consistency methods. The approach enables practical, real-time multi-prompt narratives and lays groundwork for future temporal consistency and adaptive anchor strategies in video-like generation.

Abstract

We present Infinite-Story, a training-free framework for consistent text-to-image (T2I) generation tailored for multi-prompt storytelling scenarios. Built upon a scale-wise autoregressive model, our method addresses two key challenges in consistent T2I generation: identity inconsistency and style inconsistency. To overcome these issues, we introduce three complementary techniques: Identity Prompt Replacement, which mitigates context bias in text encoders to align identity attributes across prompts; and a unified attention guidance mechanism comprising Adaptive Style Injection and Synchronized Guidance Adaptation, which jointly enforce global style and identity appearance consistency while preserving prompt fidelity. Unlike prior diffusion-based approaches that require fine-tuning or suffer from slow inference, Infinite-Story operates entirely at test time, delivering high identity and style consistency across diverse prompts. Extensive experiments demonstrate that our method achieves state-of-the-art generation performance, while offering over 6X faster inference (1.72 seconds per image) than the existing fastest consistent T2I models, highlighting its effectiveness and practicality for real-world visual storytelling.

Infinite-Story: A Training-Free Consistent Text-to-Image Generation

TL;DR

Abstract

Infinite-Story: A Training-Free Consistent Text-to-Image Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)