Training-Free Consistent Text-to-Image Generation

Yoad Tewel; Omri Kaduri; Rinon Gal; Yoni Kasten; Lior Wolf; Gal Chechik; Yuval Atzmon

Training-Free Consistent Text-to-Image Generation

Yoad Tewel, Omri Kaduri, Rinon Gal, Yoni Kasten, Lior Wolf, Gal Chechik, Yuval Atzmon

TL;DR

ConsiStory tackles the challenge of maintaining a consistent subject identity across diverse prompts in text-to-image generation without training or fine-tuning. It achieves this by introducing Subject-Driven Self-Attention, an inference-time mechanism that shares activations across generated images, supplemented by cross-image feature injection and diversification strategies to preserve layout variety. The approach yields state-of-the-art subject consistency and prompt alignment while dramatically speeding up generation, and scales to multi-subject scenes and training-free personalization of common objects. It remains compatible with existing editing tools like ControlNet, offering a practical pathway toward reliable, training-free consistency in T2I generation.

Abstract

Text-to-image models offer a new level of creative flexibility by allowing users to guide the image generation process through natural language. However, using these models to consistently portray the same subject across diverse prompts remains challenging. Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects or add image conditioning to the model. These methods require lengthy per-subject optimization or large-scale pre-training. Moreover, they struggle to align generated images with text prompts and face difficulties in portraying multiple subjects. Here, we present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model. We introduce a subject-driven shared attention block and correspondence-based feature injection to promote subject consistency between images. Additionally, we develop strategies to encourage layout diversity while maintaining subject consistency. We compare ConsiStory to a range of baselines, and demonstrate state-of-the-art performance on subject consistency and text alignment, without requiring a single optimization step. Finally, ConsiStory can naturally extend to multi-subject scenarios, and even enable training-free personalization for common objects.

Training-Free Consistent Text-to-Image Generation

TL;DR

Abstract

Paper Structure (41 sections, 8 equations, 26 figures)

This paper contains 41 sections, 8 equations, 26 figures.

Introduction
Related work
Consistent T2I generation
Attention-based Consistency.
Appearance transfer using dense correspondence maps
Preliminaries: Self-Attention in T2I models
Method
Subject-driven self-attention
Enriching layout diversity
Using Vanilla Query Features.
Self-Attention Dropout
Feature injection
Anchor images and reusable subjects
Multi-subject consistent generation
Experiments
...and 26 more sections

Figures (26)

Figure 1: Architecture outline (left): Given a set of prompts, at every generation step we localize the subject in each generated image $I_i$. We utilize the cross-attention maps up to the current generation step, to create subject masks $M_i$. Then, we replace the standard self-attention layers in the U-net decoder with Subject Driven Self-Attention layers that share information between subject instances. We also add Feature Injection for additional refinement. Subject Driven Self-Attention (right): We extend the self-attention layer so the Query from generated image $I_i$ will also have access to the Keys from all other images in the batch ($I_j$, where $j \neq i$), restricted by their subject masks $M_j$. To enrich diversity we: (1) Weaken the SDSA via dropout and (2) Blend Query features with vanilla Query features from a non-consistent sampling step, yielding $Q_1^*$.
Figure 2: Feature Injection: To further refine the subject's identity across images, we introduce a mechanism for blending features within the batch. We extract a patch correspondence map between each pair of images (Middle), and then inject features between images based on that map (Right).
Figure 3: Qualitative Results We evaluated our method against IP-Adapter, TI, and DB-LORA. Some methods failed to maintain consistency (TI), or follow the prompt (IP-Adapter). Other methods alternated between keeping consistency or following text, but not both (DB-LoRA). Our method successfully followed the prompt while maintaining consistency. Additional results are shown at figure \ref{['fig:fig_extra_qualitative_baselines']}
Figure 4: Multiple Subjects:ConsiStory generates multiple consistent subjects, while other methods often neglect at least one subject.
Figure 5: Seed Variation. Given different starting noise, ConsiStory generates different consistent set of images.
...and 21 more figures

Training-Free Consistent Text-to-Image Generation

TL;DR

Abstract

Training-Free Consistent Text-to-Image Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (26)