Table of Contents
Fetching ...

Geometric Disentanglement of Text Embeddings for Subject-Consistent Text-to-Image Generation using A Single Prompt

Shangxun Li, Youngjung Uh

TL;DR

The paper tackles the problem of preserving subject identity across frames in single-prompt text-to-image diffusion by addressing semantic entanglement in concatenated prompts. It introduces a training-free dual-subspace orthogonal projection of text embeddings, purifying suppressive semantics while preserving express content, with $S' = S - \frac{S\cdot E}{\|E\|^2} E$ and $X' = X - \alpha S'$, using $P_{\tilde{X}} = \tilde{V}\tilde{V}^{T}$ from SVD to construct projection spaces. Empirical results on the ConsiStory+ benchmark show state-of-the-art subject consistency metrics and strong text alignment, with ablations confirming the importance of protecting express semantics during suppression. Overall, the approach offers a simple, training-free, geometry-based refinement that enhances visual storytelling applications by ensuring faithful subject rendering across frames.

Abstract

Text-to-image diffusion models excel at generating high-quality images from natural language descriptions but often fail to preserve subject consistency across multiple outputs, limiting their use in visual storytelling. Existing approaches rely on model fine-tuning or image conditioning, which are computationally expensive and require per-subject optimization. 1Prompt1Story, a training-free approach, concatenates all scene descriptions into a single prompt and rescales token embeddings, but it suffers from semantic leakage, where embeddings across frames become entangled, causing text misalignment. In this paper, we propose a simple yet effective training-free approach that addresses semantic entanglement from a geometric perspective by refining text embeddings to suppress unwanted semantics. Extensive experiments prove that our approach significantly improves both subject consistency and text alignment over existing baselines.

Geometric Disentanglement of Text Embeddings for Subject-Consistent Text-to-Image Generation using A Single Prompt

TL;DR

The paper tackles the problem of preserving subject identity across frames in single-prompt text-to-image diffusion by addressing semantic entanglement in concatenated prompts. It introduces a training-free dual-subspace orthogonal projection of text embeddings, purifying suppressive semantics while preserving express content, with and , using from SVD to construct projection spaces. Empirical results on the ConsiStory+ benchmark show state-of-the-art subject consistency metrics and strong text alignment, with ablations confirming the importance of protecting express semantics during suppression. Overall, the approach offers a simple, training-free, geometry-based refinement that enhances visual storytelling applications by ensuring faithful subject rendering across frames.

Abstract

Text-to-image diffusion models excel at generating high-quality images from natural language descriptions but often fail to preserve subject consistency across multiple outputs, limiting their use in visual storytelling. Existing approaches rely on model fine-tuning or image conditioning, which are computationally expensive and require per-subject optimization. 1Prompt1Story, a training-free approach, concatenates all scene descriptions into a single prompt and rescales token embeddings, but it suffers from semantic leakage, where embeddings across frames become entangled, causing text misalignment. In this paper, we propose a simple yet effective training-free approach that addresses semantic entanglement from a geometric perspective by refining text embeddings to suppress unwanted semantics. Extensive experiments prove that our approach significantly improves both subject consistency and text alignment over existing baselines.

Paper Structure

This paper contains 10 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview.1Prompt1Story exhibits severe text misalignment due to semantic leakage (marked in red), where concepts from preceding frames bleed into subsequent ones. For instance, the “baby gorilla” from the first frame reappears incorrectly in the following two frames, and the “raincoat” from the first dog image persists in the second. Furthermore, it suffers from noticeable subject inconsistency (marked in green), where the dog's breed changes entirely in the third frame. In contrast, our method demonstrates superior performance, maintaining strong subject consistency and precise text alignment across all generated images.
  • Figure 2: An Illustration of Semantic Entanglement in Text Embeddings. As shown on the left, the causal self-attention mechanism in the text encoder causes semantic information to flow from earlier parts of the prompt (e.g., $P_1$: “dressed in a raincoat”) to later parts (e.g., $P_2$: “in a city alley”). The cosine angles on the right quantify this entanglement, revealing a high cosine similarity (0.5736) between the embeddings $x^{P_1}$ and $x^{P_2}$.
  • Figure 3: An Illustration of Our Approach Compared to 1Prompt1Story. The diagram on the left shows how a concatenated prompt is processed by a text encoder, where the causal self-attention mechanism induces semantic entanglement across frame prompt embeddings. The colored bars visualize these embeddings, with each dimension representing the semantic strength of concepts from $[P_0; P_1; P_2; P_3]$. 1Prompt1Story reweights embeddings by scaling up the current frame and downscaling others, but it suffers from semantic leakage, where entangled embeddings cause severe text misalignment, and subject inconsistency, where aggressive downscaling causes loss of identity information. In contrast, our method separates undesired semantics from the text embedding while preserving essential information for rendering the target subject.