Table of Contents
Fetching ...

Contrastive Learning Guided Latent Diffusion Model for Image-to-Image Translation

Qi Si, Bo Wang, Zhao Zhang

TL;DR

The paper addresses training-free, text-guided image-to-image translation using diffusion models, focusing on preserving content from the reference image while applying target edits. It introduces pix2pix-zeroCon, which automatically derives editing directions via BLIP/CLIP and employs cross-attention guiding loss and patch-wise CUT loss to preserve structure without additional training. Key contributions include a content-structure preservation framework, an automatic editing-direction strategy, and extensive experiments showing improved fidelity and controllability over existing zero-shot methods. The approach reduces user effort and computational cost while delivering robust editing across diverse tasks, with limitations mainly in target addition/removal scenarios that affect background and viewpoint consistency.

Abstract

The diffusion model has demonstrated superior performance in synthesizing diverse and high-quality images for text-guided image translation. However, there remains room for improvement in both the formulation of text prompts and the preservation of reference image content. First, variations in target text prompts can significantly influence the quality of the generated images, and it is often challenging for users to craft an optimal prompt that fully captures the content of the input image. Second, while existing models can introduce desired modifications to specific regions of the reference image, they frequently induce unintended alterations in areas that should remain unchanged. To address these challenges, we propose pix2pix-zeroCon, a zero-shot diffusion-based method that eliminates the need for additional training by leveraging patch-wise contrastive loss. Specifically, we automatically determine the editing direction in the text embedding space based on the reference image and target prompts. Furthermore, to ensure precise content and structural preservation in the edited image, we introduce cross-attention guiding loss and patch-wise contrastive loss between the generated and original image embeddings within a pre-trained diffusion model. Notably, our approach requires no additional training and operates directly on a pre-trained text-to-image diffusion model. Extensive experiments demonstrate that our method surpasses existing models in image-to-image translation, achieving enhanced fidelity and controllability.

Contrastive Learning Guided Latent Diffusion Model for Image-to-Image Translation

TL;DR

The paper addresses training-free, text-guided image-to-image translation using diffusion models, focusing on preserving content from the reference image while applying target edits. It introduces pix2pix-zeroCon, which automatically derives editing directions via BLIP/CLIP and employs cross-attention guiding loss and patch-wise CUT loss to preserve structure without additional training. Key contributions include a content-structure preservation framework, an automatic editing-direction strategy, and extensive experiments showing improved fidelity and controllability over existing zero-shot methods. The approach reduces user effort and computational cost while delivering robust editing across diverse tasks, with limitations mainly in target addition/removal scenarios that affect background and viewpoint consistency.

Abstract

The diffusion model has demonstrated superior performance in synthesizing diverse and high-quality images for text-guided image translation. However, there remains room for improvement in both the formulation of text prompts and the preservation of reference image content. First, variations in target text prompts can significantly influence the quality of the generated images, and it is often challenging for users to craft an optimal prompt that fully captures the content of the input image. Second, while existing models can introduce desired modifications to specific regions of the reference image, they frequently induce unintended alterations in areas that should remain unchanged. To address these challenges, we propose pix2pix-zeroCon, a zero-shot diffusion-based method that eliminates the need for additional training by leveraging patch-wise contrastive loss. Specifically, we automatically determine the editing direction in the text embedding space based on the reference image and target prompts. Furthermore, to ensure precise content and structural preservation in the edited image, we introduce cross-attention guiding loss and patch-wise contrastive loss between the generated and original image embeddings within a pre-trained diffusion model. Notably, our approach requires no additional training and operates directly on a pre-trained text-to-image diffusion model. Extensive experiments demonstrate that our method surpasses existing models in image-to-image translation, achieving enhanced fidelity and controllability.

Paper Structure

This paper contains 14 sections, 18 equations, 12 figures, 1 table, 1 algorithm.

Figures (12)

  • Figure 1: The effect of text-guided image translation is realized by our method. Our method successfully translates the source image, maintaining the structural elements of the source image while transforming the content according to the target text prompts.
  • Figure 2: Pipeline of the proposed method: To guide the diffusion model, we first obtain the text embedding and editing directions using the pre-trained BLIP and CLIP models, respectively. We then employ the cross-attention loss and CUT loss to steer the denoising process of the diffusion model.
  • Figure 3: Comparison of the similarity between the encoded text vectors and image vectors obtained before and after adding the editing direction $\Delta \boldsymbol{c}$, where text and image correspond to the examples given in Fig. \ref{['pipe']}. The "Text 2" is obtained by encoding the target prompt "A painting of a man with a mustache wearing black sunglasses". The "Text 1" is obtained by encoding the source prompt "A painting of a man with a mustache" and adding the edit direction $\Delta \boldsymbol{c}$. The "Image" is obtained by encoding the source image latent $\boldsymbol{x}_0$.
  • Figure 4: Comparison of qualitative experimental results of each method on real images.
  • Figure 5: Analysis of the average quantitative results corresponding to the qualitative results of each method in Fig. \ref{['tu1']}.
  • ...and 7 more figures