Table of Contents
Fetching ...

Highly Personalized Text Embedding for Image Manipulation by Stable Diffusion

Inhwa Han, Serin Yang, Taesung Kwon, Jong Chul Ye

TL;DR

This work addresses the difficulty of preserving subject identity during text-guided image editing with diffusion models. It introduces HiPer, a method that decomposes the CLIP embedding into a highly personalized tail and content-related components, optimizing only the tail with a single source image and target text without model fine-tuning. By combining the optimized HiPer tail with the target content embedding, the approach achieves high-precision edits across motion, background, and texture while maintaining identity, demonstrated against several baselines with efficient training. The study further analyzes cross-attention behavior to justify the separation of personalization and manipulation and highlights practical considerations and limitations, offering a fast, practical pathway for personalized image manipulation using diffusion models.

Abstract

Diffusion models have shown superior performance in image generation and manipulation, but the inherent stochasticity presents challenges in preserving and manipulating image content and identity. While previous approaches like DreamBooth and Textual Inversion have proposed model or latent representation personalization to maintain the content, their reliance on multiple reference images and complex training limits their practicality. In this paper, we present a simple yet highly effective approach to personalization using highly personalized (HiPer) text embedding by decomposing the CLIP embedding space for personalization and content manipulation. Our method does not require model fine-tuning or identifiers, yet still enables manipulation of background, texture, and motion with just a single image and target text. Through experiments on diverse target texts, we demonstrate that our approach produces highly personalized and complex semantic image edits across a wide range of tasks. We believe that the novel understanding of the text embedding space presented in this work has the potential to inspire further research across various tasks.

Highly Personalized Text Embedding for Image Manipulation by Stable Diffusion

TL;DR

This work addresses the difficulty of preserving subject identity during text-guided image editing with diffusion models. It introduces HiPer, a method that decomposes the CLIP embedding into a highly personalized tail and content-related components, optimizing only the tail with a single source image and target text without model fine-tuning. By combining the optimized HiPer tail with the target content embedding, the approach achieves high-precision edits across motion, background, and texture while maintaining identity, demonstrated against several baselines with efficient training. The study further analyzes cross-attention behavior to justify the separation of personalization and manipulation and highlights practical considerations and limitations, offering a fast, practical pathway for personalized image manipulation using diffusion models.

Abstract

Diffusion models have shown superior performance in image generation and manipulation, but the inherent stochasticity presents challenges in preserving and manipulating image content and identity. While previous approaches like DreamBooth and Textual Inversion have proposed model or latent representation personalization to maintain the content, their reliance on multiple reference images and complex training limits their practicality. In this paper, we present a simple yet highly effective approach to personalization using highly personalized (HiPer) text embedding by decomposing the CLIP embedding space for personalization and content manipulation. Our method does not require model fine-tuning or identifiers, yet still enables manipulation of background, texture, and motion with just a single image and target text. Through experiments on diverse target texts, we demonstrate that our approach produces highly personalized and complex semantic image edits across a wide range of tasks. We believe that the novel understanding of the text embedding space presented in this work has the potential to inspire further research across various tasks.
Paper Structure (32 sections, 9 equations, 14 figures, 2 tables)

This paper contains 32 sections, 9 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Image manipulation results with highly personalized (HiPer) text embeddings. In the upper row, the identities of the rabbit and the dog are well preserved while adequately manipulating the images to align with target texts. In the bottom row, not only motion and background, but also texture of the source image is transformed towards corresponding target text.
  • Figure 2: The proposed method. (Training) First, the source text prompt, which have the meaning of source image, is converted to text embedding. Some parts of text embedding, which have no information, are removed. The informative target embedding part and the personalized embedding is concatenated, and they are the input of pre-trained U-net. In training, the personalized embedding is only optimized. Although this figure depicts it as learning in image space, the embedding is actually optimized in latent space. (Inference) The target embedding is also cropped and concatenated with personalized embedding. Personalized embedding vector is calibrated by multiplying it with $\alpha=0.8$. The pre-trained text-to-image model, which conditioned that embedding, generates an image which has the meaning of target text and the subject of source image.
  • Figure 3: Cross Attention maps in the final timestep of text-to-image diffusion models. The source text is "a standing dog' and the target text is "a sitting dog". Cross Attention maps (a) conditioned with ${{\bm{e}\xspace}}_{src}$ (b) conditioned with $[{{\bm{e}\xspace}}_{src}', {{\bm{e}\xspace}}_{hper}]$, (c) conditioned with $[{{\bm{e}\xspace}}_{tgt}', {{\bm{e}\xspace}}_{hper}]$. (d) Cross attention maps by Imagic kawar2022imagic with Stable Diffusion.
  • Figure 4: The qualitative comparison results. Compared with three stable diffusion-based text-guided image manipulation methods, our method shows its superiority. It could preserve the identities of the subject in source images, while appropriately transforming the semantic information to align with the CLIP embedding of the target text. Original Imagic results with Imagen show comparable results by using proprietary text embedding scheme.
  • Figure 5: By concatenating highly personalized (HiPer) text embeddings with different target embeddings, we can achieve precise image manipulation results. This allows us to manipulate the image with high precision while preserving the subject's identity in the source image.
  • ...and 9 more figures