Table of Contents
Fetching ...

Uncovering the Text Embedding in Text-to-Image Diffusion Models

Hu Yu, Hao Luo, Fan Wang, Feng Zhao

TL;DR

The paper investigates the text embedding space in stable diffusion and shows that per-word embeddings and their contextual correlations govern image generation, enabling learning-free controllable edits. It identifies two key insights via a mask-then-generate analysis: (i) causal context for word embeddings and (ii) the dominance of semantic versus padding embeddings, enabling content/style disentanglement. It then demonstrates practical editing operations (single-word swaps, weight scaling, and semantic/padding swaps) and an optimization-based mixing framework with $\boldsymbol{\lambda}$, plus extension to real-image editing through inversion. Finally, it reveals that text embeddings possess diverse semantic potential, uncovered via SVD, with left and right singular vectors $\mathbf{u}$ and $\mathbf{v}$ defining interpretable semantic directions, enhancing semantic discovery and application.

Abstract

The correspondence between input text and the generated image exhibits opacity, wherein minor textual modifications can induce substantial deviations in the generated image. While, text embedding, as the pivotal intermediary between text and images, remains relatively underexplored. In this paper, we address this research gap by delving into the text embedding space, unleashing its capacity for controllable image editing and explicable semantic direction attributes within a learning-free framework. Specifically, we identify two critical insights regarding the importance of per-word embedding and their contextual correlations within text embedding, providing instructive principles for learning-free image editing. Additionally, we find that text embedding inherently possesses diverse semantic potentials, and further reveal this property through the lens of singular value decomposition (SVD). These uncovered properties offer practical utility for image editing and semantic discovery. More importantly, we expect the in-depth analyses and findings of the text embedding can enhance the understanding of text-to-image diffusion models.

Uncovering the Text Embedding in Text-to-Image Diffusion Models

TL;DR

The paper investigates the text embedding space in stable diffusion and shows that per-word embeddings and their contextual correlations govern image generation, enabling learning-free controllable edits. It identifies two key insights via a mask-then-generate analysis: (i) causal context for word embeddings and (ii) the dominance of semantic versus padding embeddings, enabling content/style disentanglement. It then demonstrates practical editing operations (single-word swaps, weight scaling, and semantic/padding swaps) and an optimization-based mixing framework with , plus extension to real-image editing through inversion. Finally, it reveals that text embeddings possess diverse semantic potential, uncovered via SVD, with left and right singular vectors and defining interpretable semantic directions, enhancing semantic discovery and application.

Abstract

The correspondence between input text and the generated image exhibits opacity, wherein minor textual modifications can induce substantial deviations in the generated image. While, text embedding, as the pivotal intermediary between text and images, remains relatively underexplored. In this paper, we address this research gap by delving into the text embedding space, unleashing its capacity for controllable image editing and explicable semantic direction attributes within a learning-free framework. Specifically, we identify two critical insights regarding the importance of per-word embedding and their contextual correlations within text embedding, providing instructive principles for learning-free image editing. Additionally, we find that text embedding inherently possesses diverse semantic potentials, and further reveal this property through the lens of singular value decomposition (SVD). These uncovered properties offer practical utility for image editing and semantic discovery. More importantly, we expect the in-depth analyses and findings of the text embedding can enhance the understanding of text-to-image diffusion models.
Paper Structure (20 sections, 7 equations, 12 figures, 1 algorithm)

This paper contains 20 sections, 7 equations, 12 figures, 1 algorithm.

Figures (12)

  • Figure 1: The flow chart of the text encoder in CLIP. We take the text "A photo of dog" as example. The given text prompt passes through the tokenizer, embedding lookup, and text transformer to get the corresponding text embedding.
  • Figure 2: Examples of the mask-then-generate strategy. In this strategy, we mask certain word embeddings and compare the resulting images. For example, $M_{i\text{-}j}$ denotes the generated image with the $i\text{-}th$ to $j\text{-}th$ word embeddings masked.
  • Figure 3: Manipulation in text space and text embedding space. (a) Replacing in the text space leads to random and uncontrollable image content. Slight change of text induces significant and uncontrollable deviation in the generated image. (b) Controllable object replacement is achievable by substituting key word embeddings in the text embedding space. (c) Re-scaling the weight of the descriptive word embedding leads to continuous fader control. (d) Style transfer is possible via disentangling the content and style in text embedding.
  • Figure 4: Optional optimization framework. We freeze the parameters of the diffusion model and only learn the soft mixing weight.
  • Figure 5: The SVD of the text embedding matrix. The right singular vector at the column of $V^T$ and the left singular vector at the row of $U$ are desirable semantic directions, with different singular vectors represent different semantic directions.
  • ...and 7 more figures