Table of Contents
Fetching ...

CatVersion: Concatenating Embeddings for Diffusion-Based Text-to-Image Personalization

Ruoyu Zhao, Mingrui Zhu, Shiyin Dong, Nannan Wang, Xinbo Gao

TL;DR

CatVersion tackles personalized text-to-image generation with diffusion models by learning a concept-specific residual in the feature-dense space of the CLIP text encoder. By concatenating residual embeddings to the Keys and Values in the last self-attention layers, it models the gap between a base class and a target concept, preserving prior knowledge while enabling faithful reconstruction and editing. A masked-CLIP alignment metric provides a more accurate evaluation of personalization than global image-text alignment. Ablations show the necessity of the feature-dense inversion space and residual embeddings, with experiments indicating superior performance over existing word-embedding and fine-tuning approaches. The approach offers a practical, plug-and-play pathway for robust T2I personalization and suggests broader potential for inversion-based generation techniques.

Abstract

We propose CatVersion, an inversion-based method that learns the personalized concept through a handful of examples. Subsequently, users can utilize text prompts to generate images that embody the personalized concept, thereby achieving text-to-image personalization. In contrast to existing approaches that emphasize word embedding learning or parameter fine-tuning for the diffusion model, which potentially causes concept dilution or overfitting, our method concatenates embeddings on the feature-dense space of the text encoder in the diffusion model to learn the gap between the personalized concept and its base class, aiming to maximize the preservation of prior knowledge in diffusion models while restoring the personalized concepts. To this end, we first dissect the text encoder's integration in the image generation process to identify the feature-dense space of the encoder. Afterward, we concatenate embeddings on the Keys and Values in this space to learn the gap between the personalized concept and its base class. In this way, the concatenated embeddings ultimately manifest as a residual on the original attention output. To more accurately and unbiasedly quantify the results of personalized image generation, we improve the CLIP image alignment score based on masks. Qualitatively and quantitatively, CatVersion helps to restore personalization concepts more faithfully and enables more robust editing.

CatVersion: Concatenating Embeddings for Diffusion-Based Text-to-Image Personalization

TL;DR

CatVersion tackles personalized text-to-image generation with diffusion models by learning a concept-specific residual in the feature-dense space of the CLIP text encoder. By concatenating residual embeddings to the Keys and Values in the last self-attention layers, it models the gap between a base class and a target concept, preserving prior knowledge while enabling faithful reconstruction and editing. A masked-CLIP alignment metric provides a more accurate evaluation of personalization than global image-text alignment. Ablations show the necessity of the feature-dense inversion space and residual embeddings, with experiments indicating superior performance over existing word-embedding and fine-tuning approaches. The approach offers a practical, plug-and-play pathway for robust T2I personalization and suggests broader potential for inversion-based generation techniques.

Abstract

We propose CatVersion, an inversion-based method that learns the personalized concept through a handful of examples. Subsequently, users can utilize text prompts to generate images that embody the personalized concept, thereby achieving text-to-image personalization. In contrast to existing approaches that emphasize word embedding learning or parameter fine-tuning for the diffusion model, which potentially causes concept dilution or overfitting, our method concatenates embeddings on the feature-dense space of the text encoder in the diffusion model to learn the gap between the personalized concept and its base class, aiming to maximize the preservation of prior knowledge in diffusion models while restoring the personalized concepts. To this end, we first dissect the text encoder's integration in the image generation process to identify the feature-dense space of the encoder. Afterward, we concatenate embeddings on the Keys and Values in this space to learn the gap between the personalized concept and its base class. In this way, the concatenated embeddings ultimately manifest as a residual on the original attention output. To more accurately and unbiasedly quantify the results of personalized image generation, we improve the CLIP image alignment score based on masks. Qualitatively and quantitatively, CatVersion helps to restore personalization concepts more faithfully and enables more robust editing.
Paper Structure (17 sections, 6 equations, 7 figures, 3 tables)

This paper contains 17 sections, 6 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: CatVersion allows users to learn the personalized concept through a handful of examples and then utilize text prompts to generate images that embody the personalized concept. In contrast to existing approaches, CatVersion concatenates embeddings on the feature-dense space of the text encoder in the diffusion model to learn the gap between the personalized concept and its base class, aiming to maximize the preservation of prior knowledge in diffusion models while restoring the personalized concepts.
  • Figure 2: CatVersion versus Textual Inversion. We showcase the results of Textual Inversion gal2022image with our CatVersion. As shown in (a), Textual Inversion fails to capture the personalized concept, while CatVersion accurately restores it. We contrast the distinctions in the inversion spaces of the two methods in (b) and (c), underscoring the advantages of inversion in feature-dense space.
  • Figure 3: Overall Pipeline of CatVersion. Firstly, we identify the feature-dense layers in the CLIP text encoder. Then, we concatenate the residual embeddings with Keys and Values. In the optimization process, we use the base class word (e.g. dog) of the personalized concept as text input and optimize these residual embeddings utilizing a handful of images depicting one personalized concept. During inference, residual embeddings of CatVersion can be deleted and replaced to achieve different personalized needs.
  • Figure 4: Visualizing Inversion across Multiple Layers. We concatenate embeddings and optimize them in each of the two self-attention layers in the CLIP text encoder. Then, we use these embeddings in combination with free text to create new scenarios for personalized concepts. The results indicate that the self-attention layers of different depths focus on integrating different information. Moreover, the focus of information integration has also shifted from concreteness to abstraction.
  • Figure 5: Qualitative Comparisons with Existing Methods. Our CatVersion more faithfully restores personalized concepts and achieves more powerful editing capabilities in the combination of various concepts and free text.
  • ...and 2 more figures