Table of Contents
Fetching ...

Selectively Informative Description can Reduce Undesired Embedding Entanglements in Text-to-Image Personalization

Jimyeong Kim, Jungwon Park, Wonjong Rhee

TL;DR

The paper tackles the problem of undesired embedding entanglement in text-to-image personalization, where reference biases leak into generated images and misalign with prompts. It introduces Selectively Informative Description (SID), a training-description augmentation generated by multimodal GPT-4 that adds informative details about undesired objects while preserving subject identity, and integrates SID into optimization-based personalization methods. Through cross-attention analyses, tailored evaluation metrics (subject-alignment, non-subject-disentanglement, and text-alignment), and extensive experiments across multiple personalization models and datasets, SID consistently reduces entanglement and improves alignment with prompts. The findings demonstrate SID's effectiveness in mitigating biases such as background, nearby-object, tied-object, substance, and pose, with implications for more faithful and controllable personalized generation in multi-modal settings.

Abstract

In text-to-image personalization, a timely and crucial challenge is the tendency of generated images overfitting to the biases present in the reference images. We initiate our study with a comprehensive categorization of the biases into background, nearby-object, tied-object, substance (in style re-contextualization), and pose biases. These biases manifest in the generated images due to their entanglement into the subject embedding. This undesired embedding entanglement not only results in the reflection of biases from the reference images into the generated images but also notably diminishes the alignment of the generated images with the given generation prompt. To address this challenge, we propose SID~(Selectively Informative Description), a text description strategy that deviates from the prevalent approach of only characterizing the subject's class identification. SID is generated utilizing multimodal GPT-4 and can be seamlessly integrated into optimization-based models. We present comprehensive experimental results along with analyses of cross-attention maps, subject-alignment, non-subject-disentanglement, and text-alignment.

Selectively Informative Description can Reduce Undesired Embedding Entanglements in Text-to-Image Personalization

TL;DR

The paper tackles the problem of undesired embedding entanglement in text-to-image personalization, where reference biases leak into generated images and misalign with prompts. It introduces Selectively Informative Description (SID), a training-description augmentation generated by multimodal GPT-4 that adds informative details about undesired objects while preserving subject identity, and integrates SID into optimization-based personalization methods. Through cross-attention analyses, tailored evaluation metrics (subject-alignment, non-subject-disentanglement, and text-alignment), and extensive experiments across multiple personalization models and datasets, SID consistently reduces entanglement and improves alignment with prompts. The findings demonstrate SID's effectiveness in mitigating biases such as background, nearby-object, tied-object, substance, and pose, with implications for more faithful and controllable personalized generation in multi-modal settings.

Abstract

In text-to-image personalization, a timely and crucial challenge is the tendency of generated images overfitting to the biases present in the reference images. We initiate our study with a comprehensive categorization of the biases into background, nearby-object, tied-object, substance (in style re-contextualization), and pose biases. These biases manifest in the generated images due to their entanglement into the subject embedding. This undesired embedding entanglement not only results in the reflection of biases from the reference images into the generated images but also notably diminishes the alignment of the generated images with the given generation prompt. To address this challenge, we propose SID~(Selectively Informative Description), a text description strategy that deviates from the prevalent approach of only characterizing the subject's class identification. SID is generated utilizing multimodal GPT-4 and can be seamlessly integrated into optimization-based models. We present comprehensive experimental results along with analyses of cross-attention maps, subject-alignment, non-subject-disentanglement, and text-alignment.
Paper Structure (20 sections, 2 equations, 11 figures, 1 table)

This paper contains 20 sections, 2 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: Five key biases -- background, nearby-object, tied-object, substance (in style re-contextualization), and pose biases. The first four rows depict scenarios with multiple reference images, while the last row illustrates a single reference image scenario. The pose bias is particularly prone to manifest in scenarios involving a single reference image, although it can also occur when multiple reference images depict the subject in poses that are similar. In the generation prompt, the subject of interest is highlighted in red. The integration of our method into DreamBooth ruiz2023dreambooth effectively resolves embedding entanglements associated with the five key biases (rightmost column).
  • Figure 2: Personalization with SID. We propose integrating SID (Selectively Informative Description) into the per-subject optimization, where an instruction-following VLM (Vison-Language Model) is utilized to generate a selectively informative description for each reference image.
  • Figure 3: Two examples for comparing the four cases of descriptions. For each case, the choice of train description follows the guidelines in \ref{['tab:prompt_ablation']}. The common generation prompt for each example is shown below the generated images. Additional examples can be found in \ref{['sec: Supple_four description cases']}.
  • Figure 4: Comparison of three instruction-following VLMs for generating SID. For the reference image of a cat and the instructions shown in the top, the three VLMs generate image captions shown in the right side of the image. Subsequently, the unique identifier [v] is inserted and the resulting captions are used for conditioning the diffusion model as the train descriptions. For painting/cartoon style re-contextualization, we used a slightly different instruction, as detailed in \ref{['sec: Supple_for style re-contextualization']}. Additional fifteen examples for VLM-generated SIDs are available in \ref{['sec: Supple_instruction-following VLMs']}.
  • Figure 5: Enhancement by SID. For four optimization-based models (DreamBooth ruiz2023dreambooth, Custom Diffusion kumari2023multi, SVDiff han2023svdiff, and Textual Inversion gal2022image), the baseline results are shown together with SID-integrated results. SID-integration effectively resolves entanglement issues in scenarios with high biases, represented by indoor background (1st row), nearby potted plant (2nd row), filled-in blueberries (3rd row), and sunflower substances (last row). Additional examples can be found in \ref{['sec: Supple_enhancement by SID']}.
  • ...and 6 more figures