Contrastive Learning Guided Latent Diffusion Model for Image-to-Image Translation

Qi Si; Bo Wang; Zhao Zhang

Contrastive Learning Guided Latent Diffusion Model for Image-to-Image Translation

Qi Si, Bo Wang, Zhao Zhang

TL;DR

The paper addresses training-free, text-guided image-to-image translation using diffusion models, focusing on preserving content from the reference image while applying target edits. It introduces pix2pix-zeroCon, which automatically derives editing directions via BLIP/CLIP and employs cross-attention guiding loss and patch-wise CUT loss to preserve structure without additional training. Key contributions include a content-structure preservation framework, an automatic editing-direction strategy, and extensive experiments showing improved fidelity and controllability over existing zero-shot methods. The approach reduces user effort and computational cost while delivering robust editing across diverse tasks, with limitations mainly in target addition/removal scenarios that affect background and viewpoint consistency.

Abstract

The diffusion model has demonstrated superior performance in synthesizing diverse and high-quality images for text-guided image translation. However, there remains room for improvement in both the formulation of text prompts and the preservation of reference image content. First, variations in target text prompts can significantly influence the quality of the generated images, and it is often challenging for users to craft an optimal prompt that fully captures the content of the input image. Second, while existing models can introduce desired modifications to specific regions of the reference image, they frequently induce unintended alterations in areas that should remain unchanged. To address these challenges, we propose pix2pix-zeroCon, a zero-shot diffusion-based method that eliminates the need for additional training by leveraging patch-wise contrastive loss. Specifically, we automatically determine the editing direction in the text embedding space based on the reference image and target prompts. Furthermore, to ensure precise content and structural preservation in the edited image, we introduce cross-attention guiding loss and patch-wise contrastive loss between the generated and original image embeddings within a pre-trained diffusion model. Notably, our approach requires no additional training and operates directly on a pre-trained text-to-image diffusion model. Extensive experiments demonstrate that our method surpasses existing models in image-to-image translation, achieving enhanced fidelity and controllability.

Contrastive Learning Guided Latent Diffusion Model for Image-to-Image Translation

TL;DR

Abstract

Contrastive Learning Guided Latent Diffusion Model for Image-to-Image Translation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)