Table of Contents
Fetching ...

DECOR:Decomposition and Projection of Text Embeddings for Text-to-Image Customization

Geonhui Jang, Jin-Hwa Kim, Yong-Hyun Park, Junho Kim, Gayoung Lee, Yonghyun Jeong

TL;DR

DECOR tackles overfitting in LoRA-based text-to-image customization by revealing that undesired semantics are entangled within word token embeddings. Through singular value decomposition of CLIP text embeddings, the method identifies dominant axes linked to PAD tokens and introduces a training-free projection strategy that suppresses these axes during inference using $X' = X - \alpha X P_{\tilde{X}}$ with $P_{\tilde{X}} = \tilde{V} \tilde{V}^T$. Empirical results across personalization, stylization, and content-style mixing demonstrate state-of-the-art performance and a favorable Pareto frontier between text alignment and visual fidelity, achieved without additional training. The work provides interpretability of embedding-space geometry and offers a practical, plug-in refinement that can complement other customization techniques.

Abstract

Text-to-image (T2I) models can effectively capture the content or style of reference images to perform high-quality customization. A representative technique for this is fine-tuning using low-rank adaptations (LoRA), which enables efficient model customization with reference images. However, fine-tuning with a limited number of reference images often leads to overfitting, resulting in issues such as prompt misalignment or content leakage. These issues prevent the model from accurately following the input prompt or generating undesired objects during inference. To address this problem, we examine the text embeddings that guide the diffusion model during inference. This study decomposes the text embedding matrix and conducts a component analysis to understand the embedding space geometry and identify the cause of overfitting. Based on this, we propose DECOR, which projects text embeddings onto a vector space orthogonal to undesired token vectors, thereby reducing the influence of unwanted semantics in the text embeddings. Experimental results demonstrate that DECOR outperforms state-of-the-art customization models and achieves Pareto frontier performance across text and visual alignment evaluation metrics. Furthermore, it generates images more faithful to the input prompts, showcasing its effectiveness in addressing overfitting and enhancing text-to-image customization.

DECOR:Decomposition and Projection of Text Embeddings for Text-to-Image Customization

TL;DR

DECOR tackles overfitting in LoRA-based text-to-image customization by revealing that undesired semantics are entangled within word token embeddings. Through singular value decomposition of CLIP text embeddings, the method identifies dominant axes linked to PAD tokens and introduces a training-free projection strategy that suppresses these axes during inference using with . Empirical results across personalization, stylization, and content-style mixing demonstrate state-of-the-art performance and a favorable Pareto frontier between text alignment and visual fidelity, achieved without additional training. The work provides interpretability of embedding-space geometry and offers a practical, plug-in refinement that can complement other customization techniques.

Abstract

Text-to-image (T2I) models can effectively capture the content or style of reference images to perform high-quality customization. A representative technique for this is fine-tuning using low-rank adaptations (LoRA), which enables efficient model customization with reference images. However, fine-tuning with a limited number of reference images often leads to overfitting, resulting in issues such as prompt misalignment or content leakage. These issues prevent the model from accurately following the input prompt or generating undesired objects during inference. To address this problem, we examine the text embeddings that guide the diffusion model during inference. This study decomposes the text embedding matrix and conducts a component analysis to understand the embedding space geometry and identify the cause of overfitting. Based on this, we propose DECOR, which projects text embeddings onto a vector space orthogonal to undesired token vectors, thereby reducing the influence of unwanted semantics in the text embeddings. Experimental results demonstrate that DECOR outperforms state-of-the-art customization models and achieves Pareto frontier performance across text and visual alignment evaluation metrics. Furthermore, it generates images more faithful to the input prompts, showcasing its effectiveness in addressing overfitting and enhancing text-to-image customization.

Paper Structure

This paper contains 21 sections, 3 equations, 26 figures, 1 table.

Figures (26)

  • Figure 1: (a) The CLIP text embeddings have a large first singular value due to the high similarity of the [PAD] tokens. (b) The pattern of embedding reconstruction differs according to the magnitude of the singular values.
  • Figure 2: Customization results with the original embeddings (baseline) and the embeddings reconstructed using selected components (others).
  • Figure 3: When the components along the axis of the unwanted word are subtracted from the original prompt embedding, this adjustment is reflected in the image generation results. From left to right, $\alpha$ is 0.5, 0.75, and 1.0.
  • Figure 4: Comparison of the original and our inference pipeline. (a) In the standard approach, text embeddings are input into both the base and LoRA weights. (b) In the proposed method, we project the embedding onto word embedding space, and separate it from the original embedding. These manipulated embeddings are input into the LoRA layers, improving generation fidelity.
  • Figure 5: Simply removing the word embedding $X_\textnormal{w}$ from the original embedding or using an embedding reconstructed without the subsequent components cannot solve the overfitting problem.
  • ...and 21 more figures