Geometric Disentanglement of Text Embeddings for Subject-Consistent Text-to-Image Generation using A Single Prompt

Shangxun Li; Youngjung Uh

Geometric Disentanglement of Text Embeddings for Subject-Consistent Text-to-Image Generation using A Single Prompt

Shangxun Li, Youngjung Uh

TL;DR

The paper tackles the problem of preserving subject identity across frames in single-prompt text-to-image diffusion by addressing semantic entanglement in concatenated prompts. It introduces a training-free dual-subspace orthogonal projection of text embeddings, purifying suppressive semantics while preserving express content, with $S' = S - \frac{S\cdot E}{\|E\|^2} E$ and $X' = X - \alpha S'$, using $P_{\tilde{X}} = \tilde{V}\tilde{V}^{T}$ from SVD to construct projection spaces. Empirical results on the ConsiStory+ benchmark show state-of-the-art subject consistency metrics and strong text alignment, with ablations confirming the importance of protecting express semantics during suppression. Overall, the approach offers a simple, training-free, geometry-based refinement that enhances visual storytelling applications by ensuring faithful subject rendering across frames.

Abstract

Text-to-image diffusion models excel at generating high-quality images from natural language descriptions but often fail to preserve subject consistency across multiple outputs, limiting their use in visual storytelling. Existing approaches rely on model fine-tuning or image conditioning, which are computationally expensive and require per-subject optimization. 1Prompt1Story, a training-free approach, concatenates all scene descriptions into a single prompt and rescales token embeddings, but it suffers from semantic leakage, where embeddings across frames become entangled, causing text misalignment. In this paper, we propose a simple yet effective training-free approach that addresses semantic entanglement from a geometric perspective by refining text embeddings to suppress unwanted semantics. Extensive experiments prove that our approach significantly improves both subject consistency and text alignment over existing baselines.

Geometric Disentanglement of Text Embeddings for Subject-Consistent Text-to-Image Generation using A Single Prompt

TL;DR

Abstract

Geometric Disentanglement of Text Embeddings for Subject-Consistent Text-to-Image Generation using A Single Prompt

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)