Geometric Disentanglement of Text Embeddings for Subject-Consistent Text-to-Image Generation using A Single Prompt
Shangxun Li, Youngjung Uh
TL;DR
The paper tackles the problem of preserving subject identity across frames in single-prompt text-to-image diffusion by addressing semantic entanglement in concatenated prompts. It introduces a training-free dual-subspace orthogonal projection of text embeddings, purifying suppressive semantics while preserving express content, with $S' = S - \frac{S\cdot E}{\|E\|^2} E$ and $X' = X - \alpha S'$, using $P_{\tilde{X}} = \tilde{V}\tilde{V}^{T}$ from SVD to construct projection spaces. Empirical results on the ConsiStory+ benchmark show state-of-the-art subject consistency metrics and strong text alignment, with ablations confirming the importance of protecting express semantics during suppression. Overall, the approach offers a simple, training-free, geometry-based refinement that enhances visual storytelling applications by ensuring faithful subject rendering across frames.
Abstract
Text-to-image diffusion models excel at generating high-quality images from natural language descriptions but often fail to preserve subject consistency across multiple outputs, limiting their use in visual storytelling. Existing approaches rely on model fine-tuning or image conditioning, which are computationally expensive and require per-subject optimization. 1Prompt1Story, a training-free approach, concatenates all scene descriptions into a single prompt and rescales token embeddings, but it suffers from semantic leakage, where embeddings across frames become entangled, causing text misalignment. In this paper, we propose a simple yet effective training-free approach that addresses semantic entanglement from a geometric perspective by refining text embeddings to suppress unwanted semantics. Extensive experiments prove that our approach significantly improves both subject consistency and text alignment over existing baselines.
