Table of Contents
Fetching ...

Learning to Customize Text-to-Image Diffusion In Diverse Context

Taewook Kim, Wei Chen, Qiang Qiu

TL;DR

This work tackles concept overfitting in personalizing text-to-image diffusion models by diversifying context purely in the textual space. It introduces a Masked Language Modeling objective, used with a context-rich prompt set and a pretrained contextualizer, to regularize concept embeddings and preserve contextual semantics during customization. The method is architecture-agnostic and yields consistent improvements in text-prompt fidelity and, correspondingly, image prompt fidelity across four baseline T2I customization methods. Theoretical and empirical analyses demonstrate that semantic enhancement in the textual space translates to improved generation in the image space, making the approach broadly applicable and cost-effective. Overall, the approach offers a practical pathway to strengthen semantic alignment without altering model architectures or requiring large additional data.

Abstract

Most text-to-image customization techniques fine-tune models on a small set of \emph{personal concept} images captured in minimal contexts. This often results in the model becoming overfitted to these training images and unable to generalize to new contexts in future text prompts. Existing customization methods are built on the success of effectively representing personal concepts as textual embeddings. Thus, in this work, we resort to diversifying the context of these personal concepts \emph{solely} within the textual space by simply creating a contextually rich set of text prompts, together with a widely used self-supervised learning objective. Surprisingly, this straightforward and cost-effective method significantly improves semantic alignment in the textual space, and this effect further extends to the image space, resulting in higher prompt fidelity for generated images. Additionally, our approach does not require any architectural modifications, making it highly compatible with existing text-to-image customization methods. We demonstrate the broad applicability of our approach by combining it with four different baseline methods, achieving notable CLIP score improvements.

Learning to Customize Text-to-Image Diffusion In Diverse Context

TL;DR

This work tackles concept overfitting in personalizing text-to-image diffusion models by diversifying context purely in the textual space. It introduces a Masked Language Modeling objective, used with a context-rich prompt set and a pretrained contextualizer, to regularize concept embeddings and preserve contextual semantics during customization. The method is architecture-agnostic and yields consistent improvements in text-prompt fidelity and, correspondingly, image prompt fidelity across four baseline T2I customization methods. Theoretical and empirical analyses demonstrate that semantic enhancement in the textual space translates to improved generation in the image space, making the approach broadly applicable and cost-effective. Overall, the approach offers a practical pathway to strengthen semantic alignment without altering model architectures or requiring large additional data.

Abstract

Most text-to-image customization techniques fine-tune models on a small set of \emph{personal concept} images captured in minimal contexts. This often results in the model becoming overfitted to these training images and unable to generalize to new contexts in future text prompts. Existing customization methods are built on the success of effectively representing personal concepts as textual embeddings. Thus, in this work, we resort to diversifying the context of these personal concepts \emph{solely} within the textual space by simply creating a contextually rich set of text prompts, together with a widely used self-supervised learning objective. Surprisingly, this straightforward and cost-effective method significantly improves semantic alignment in the textual space, and this effect further extends to the image space, resulting in higher prompt fidelity for generated images. Additionally, our approach does not require any architectural modifications, making it highly compatible with existing text-to-image customization methods. We demonstrate the broad applicability of our approach by combining it with four different baseline methods, achieving notable CLIP score improvements.

Paper Structure

This paper contains 34 sections, 3 theorems, 20 equations, 15 figures, 4 tables, 2 algorithms.

Key Result

Proposition 1

The model overfitting to the concept token makes the attention map mostly attend to the concept token, i.e., $A[i, j_*] \gg A[i, j], \forall j\neq j_*$, where $j_*$ is the index of the concept token. The distance between the context embeddings $c_i$ and the concept embedding $c_{i_*}$ is bounded,

Figures (15)

  • Figure 1: Comparison across various text-to-image models before and after integrating our method. The proposed approach consistently enhances prompt fidelity in generation results.
  • Figure 2: Conceptual illustration of the proposed approach. Left: We propose to diversify the context of the personal concept solely within the textual space, by simply constructing a context-rich text prompt set with a concept token. Right: In our method, the concept token embeddings are effectively guided to learn the relationship between the surrounding tokens in diverse contexts. This leads to the semantic enhancement of text representation by preserving the contextual information, which ultimately leads to higher text prompt fidelity in image generation. The proposed method is demonstrated both theoretically and empirically in the paper.
  • Figure 3: Illustration of the proposed text-to-image customization process. The MLM loss $\mathcal{L}_\text{MLM}$ is computed, along with the denoising loss $\mathcal{L}_\text{Diff}$, to align the special concept image with the concept token. For MLM, we sample text prompts from a contextually diverse prompt set. The sampled prompt is then tokenized and mapped to a prompt embedding P. Subsequently, a subset of the input tokens are masked to yield $\textbf{P}_\text{masked}$, and fed into CLIP text encoder to output $\textbf{C}_\text{masked}$. Then, the masked embedding is contextualized with the surrounding tokens, including the concept token and the context tokens, by self-attention layers. After that, the masked token is predicted. As the concept token is trained to predict the best semantically aligned token with $\mathcal{L}_\text{MLM}$, the concept token embedding effectively learns its context. For computing $\mathcal{L}_\text{Diff}$, we use the context-simple caption, the same as the baseline. Textual Inversion TI is used as an example baseline here.
  • Figure 4: Illustrative comparison between the baseline approach and ours. Left: The baseline approach is prone to losing the semantics of the contexts, as the concept token embedding only learns to associate the tokens within limited contexts that correspond to the same concept image. As a result, the semantics of the distinct subject tokens become similar, leading to concept overfitting. Right: In contrast, MLM regularizes the loss of contextual semantics, as their elimination leads to ineffective mask predictions. Also, by deploying MLM with diverse contexts, the concept token embedding learns to both associate and disassociate the context tokens. By learning to disassociate the distinct subject, the contextual semantics are preserved.
  • Figure 5: Visualization of $16\times16$ attention maps from cross-attention layers. Top: Baseline. Bottom. Our approach results in cross-attention maps of the concept token and the context token being more distinctively distributed, leading to semantically enhanced image generation.
  • ...and 10 more figures

Theorems & Definitions (9)

  • Proposition 1
  • Proposition 2
  • Remark 3
  • Proposition 4
  • Remark 5
  • Remark 6
  • proof
  • proof
  • proof