Table of Contents
Fetching ...

Harmonizing Visual and Textual Embeddings for Zero-Shot Text-to-Image Customization

Yeji Song, Jimyeong Kim, Wonhark Park, Wonsik Shin, Wonjong Rhee, Nojun Kwak

TL;DR

This work proposes text-orthogonal visual embedding which effectively harmonizes with the given textual embedding and injects the subject's clear features utilizing a self-attention swap, offering highly flexible zero-shot generation while effectively maintaining the subject's identity.

Abstract

In a surge of text-to-image (T2I) models and their customization methods that generate new images of a user-provided subject, current works focus on alleviating the costs incurred by a lengthy per-subject optimization. These zero-shot customization methods encode the image of a specified subject into a visual embedding which is then utilized alongside the textual embedding for diffusion guidance. The visual embedding incorporates intrinsic information about the subject, while the textual embedding provides a new, transient context. However, the existing methods often 1) are significantly affected by the input images, eg., generating images with the same pose, and 2) exhibit deterioration in the subject's identity. We first pin down the problem and show that redundant pose information in the visual embedding interferes with the textual embedding containing the desired pose information. To address this issue, we propose orthogonal visual embedding which effectively harmonizes with the given textual embedding. We also adopt the visual-only embedding and inject the subject's clear features utilizing a self-attention swap. Our results demonstrate the effectiveness and robustness of our method, which offers highly flexible zero-shot generation while effectively maintaining the subject's identity.

Harmonizing Visual and Textual Embeddings for Zero-Shot Text-to-Image Customization

TL;DR

This work proposes text-orthogonal visual embedding which effectively harmonizes with the given textual embedding and injects the subject's clear features utilizing a self-attention swap, offering highly flexible zero-shot generation while effectively maintaining the subject's identity.

Abstract

In a surge of text-to-image (T2I) models and their customization methods that generate new images of a user-provided subject, current works focus on alleviating the costs incurred by a lengthy per-subject optimization. These zero-shot customization methods encode the image of a specified subject into a visual embedding which is then utilized alongside the textual embedding for diffusion guidance. The visual embedding incorporates intrinsic information about the subject, while the textual embedding provides a new, transient context. However, the existing methods often 1) are significantly affected by the input images, eg., generating images with the same pose, and 2) exhibit deterioration in the subject's identity. We first pin down the problem and show that redundant pose information in the visual embedding interferes with the textual embedding containing the desired pose information. To address this issue, we propose orthogonal visual embedding which effectively harmonizes with the given textual embedding. We also adopt the visual-only embedding and inject the subject's clear features utilizing a self-attention swap. Our results demonstrate the effectiveness and robustness of our method, which offers highly flexible zero-shot generation while effectively maintaining the subject's identity.
Paper Structure (17 sections, 5 equations, 8 figures, 1 table)

This paper contains 17 sections, 5 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Our method deals with the challenges of pose variation in zero-shot customization methods i.e., (i) strong adherence to the pose in the input image and (ii) loss in the subject’s identity. Our method paves the way for a more diverse, efficient, and lively subject-driven generation.
  • Figure 2: Visualization of discord among contextual embeddings. (a) The contextual embedding of zero-shot customization methods is generally composed of visual and textual embedding, each of which provides the subject's identity and a novel context, respectively. (b) However, pose information in the visual embedding interferes with the textual embedding, resulting in biased images that are closely attached to the input image. Meanwhile, our orchestration resolves the conflict, generating images that are both identity-conservative and faithful to the text prompt.
  • Figure 3: The overall pipeline of our method. (a) To alleviate the pose-bias due to the visual embedding, we conduct Orchestration of the contextual embedding i.e., adjust the visual embedding to be orthogonal to the textual embedding. Orchestration accurately guides the denoising process toward a direction that follows the pose directed by the text prompt. (b) Self-attention Swap obtains self-attention key and value from another denoising process guided by visual-only embedding, which offers the subject's clean identity. After swapping, our method retains the subject's identity while changing its pose faithfully following the text prompt.
  • Figure 4: Comparisons with BLIP-Diffusion.
  • Figure 5: Comparisons with ELITE.
  • ...and 3 more figures