Table of Contents
Fetching ...

CoCoDiff: Correspondence-Consistent Diffusion Model for Fine-grained Style Transfer

Wenbo Nie, Zixiang Li, Renshuai Tao, Bin Wu, Yunchao Wei, Yao Zhao

TL;DR

CoDiff is proposed, a novel training-free and low-cost style transfer framework that leverages pretrained latent diffusion models to achieve fine-grained, semantically consistent stylization.

Abstract

Transferring visual style between images while preserving semantic correspondence between similar objects remains a central challenge in computer vision. While existing methods have made great strides, most of them operate at global level but overlook region-wise and even pixel-wise semantic correspondence. To address this, we propose CoCoDiff, a novel training-free and low-cost style transfer framework that leverages pretrained latent diffusion models to achieve fine-grained, semantically consistent stylization. We identify that correspondence cues within generative diffusion models are under-explored and that content consistency across semantically matched regions is often neglected. CoCoDiff introduces a pixel-wise semantic correspondence module that mines intermediate diffusion features to construct a dense alignment map between content and style images. Furthermore, a cycle-consistency module then enforces structural and perceptual alignment across iterations, yielding object and region level stylization that preserves geometry and detail. Despite requiring no additional training or supervision, CoCoDiff delivers state-of-the-art visual quality and strong quantitative results, outperforming methods that rely on extra training or annotations.

CoCoDiff: Correspondence-Consistent Diffusion Model for Fine-grained Style Transfer

TL;DR

CoDiff is proposed, a novel training-free and low-cost style transfer framework that leverages pretrained latent diffusion models to achieve fine-grained, semantically consistent stylization.

Abstract

Transferring visual style between images while preserving semantic correspondence between similar objects remains a central challenge in computer vision. While existing methods have made great strides, most of them operate at global level but overlook region-wise and even pixel-wise semantic correspondence. To address this, we propose CoCoDiff, a novel training-free and low-cost style transfer framework that leverages pretrained latent diffusion models to achieve fine-grained, semantically consistent stylization. We identify that correspondence cues within generative diffusion models are under-explored and that content consistency across semantically matched regions is often neglected. CoCoDiff introduces a pixel-wise semantic correspondence module that mines intermediate diffusion features to construct a dense alignment map between content and style images. Furthermore, a cycle-consistency module then enforces structural and perceptual alignment across iterations, yielding object and region level stylization that preserves geometry and detail. Despite requiring no additional training or supervision, CoCoDiff delivers state-of-the-art visual quality and strong quantitative results, outperforming methods that rely on extra training or annotations.
Paper Structure (36 sections, 16 equations, 13 figures, 6 tables, 1 algorithm)

This paper contains 36 sections, 16 equations, 13 figures, 6 tables, 1 algorithm.

Figures (13)

  • Figure 1: Comparison of different style transfer methods. We compare our proposed CoCoDiff with three other representative methods with zoomed-in details.
  • Figure 2: Framework Overview of our proposed Correspondence-Consistent Diffusion (CoCoDiff).
  • Figure 3: Qualitative comparison. We compare CoCoDiff (Ours) with seven representative methods, selected from diffusion-based, patch-based, CNN-based, transformer-based, and other approaches, to provide a comprehensive evaluation.
  • Figure 4: (A) Qualitative comparison with additional zoomed-in details. We compare our method with StyleID and StyTR$^{2}$ as baseline approaches, highlighting the differences through zoomed-in details. (B) The illustration of cycle-based image style transfer. (a) Direct feature matching between the style image and the content image often results in low matching accuracy and feature correspondence failure. (b) By first transforming the style image to adopt the content image’s style before performing feature matching, the matching accuracy is significantly improved. (c) Direct correspondence result. (d) Indirect correspondence result.
  • Figure 5: Quantitative comparison of the cycle module. (a) Balance between LPIPS and FID metrics across iterations. (b) CFSD variations across iterations.
  • ...and 8 more figures