Table of Contents
Fetching ...

MIDMs: Matching Interleaved Diffusion Models for Exemplar-based Image Translation

Junyoung Seo, Gyuseong Lee, Seokju Cho, Jiyoung Lee, Seungryong Kim

TL;DR

MIDMs introduce a diffusion-guided, interleaved framework for exemplar-based image translation, addressing the limitations of GAN-based matching-then-generation by refining cross-domain correspondences inside the diffusion process. The approach uses latent-space encoders to obtain domain-invariant features, soft-correlates and warps exemplars, and then iteratively refines the warped latent through diffusion while selectively rewarping confident regions via cycle-consistency. Losses across cross-domain correspondence, perceptual/style fidelity, and diffusion-prior refinement jointly guide the system to preserve content while transferring exemplar style, with strong empirical results on CelebA-HQ, DeepFashion, and LSUN-Churches and comprehensive ablations. The work demonstrates competitive or superior performance in quality, fidelity, and style relevance, while highlighting practical considerations like slower sampling and broader societal impacts.

Abstract

We present a novel method for exemplar-based image translation, called matching interleaved diffusion models (MIDMs). Most existing methods for this task were formulated as GAN-based matching-then-generation framework. However, in this framework, matching errors induced by the difficulty of semantic matching across cross-domain, e.g., sketch and photo, can be easily propagated to the generation step, which in turn leads to degenerated results. Motivated by the recent success of diffusion models overcoming the shortcomings of GANs, we incorporate the diffusion models to overcome these limitations. Specifically, we formulate a diffusion-based matching-and-generation framework that interleaves cross-domain matching and diffusion steps in the latent space by iteratively feeding the intermediate warp into the noising process and denoising it to generate a translated image. In addition, to improve the reliability of the diffusion process, we design a confidence-aware process using cycle-consistency to consider only confident regions during translation. Experimental results show that our MIDMs generate more plausible images than state-of-the-art methods.

MIDMs: Matching Interleaved Diffusion Models for Exemplar-based Image Translation

TL;DR

MIDMs introduce a diffusion-guided, interleaved framework for exemplar-based image translation, addressing the limitations of GAN-based matching-then-generation by refining cross-domain correspondences inside the diffusion process. The approach uses latent-space encoders to obtain domain-invariant features, soft-correlates and warps exemplars, and then iteratively refines the warped latent through diffusion while selectively rewarping confident regions via cycle-consistency. Losses across cross-domain correspondence, perceptual/style fidelity, and diffusion-prior refinement jointly guide the system to preserve content while transferring exemplar style, with strong empirical results on CelebA-HQ, DeepFashion, and LSUN-Churches and comprehensive ablations. The work demonstrates competitive or superior performance in quality, fidelity, and style relevance, while highlighting practical considerations like slower sampling and broader societal impacts.

Abstract

We present a novel method for exemplar-based image translation, called matching interleaved diffusion models (MIDMs). Most existing methods for this task were formulated as GAN-based matching-then-generation framework. However, in this framework, matching errors induced by the difficulty of semantic matching across cross-domain, e.g., sketch and photo, can be easily propagated to the generation step, which in turn leads to degenerated results. Motivated by the recent success of diffusion models overcoming the shortcomings of GANs, we incorporate the diffusion models to overcome these limitations. Specifically, we formulate a diffusion-based matching-and-generation framework that interleaves cross-domain matching and diffusion steps in the latent space by iteratively feeding the intermediate warp into the noising process and denoising it to generate a translated image. In addition, to improve the reliability of the diffusion process, we design a confidence-aware process using cycle-consistency to consider only confident regions during translation. Experimental results show that our MIDMs generate more plausible images than state-of-the-art methods.
Paper Structure (48 sections, 17 equations, 11 figures, 9 tables, 1 algorithm)

This paper contains 48 sections, 17 equations, 11 figures, 9 tables, 1 algorithm.

Figures (11)

  • Figure 1: Motivation: (a) existing works Liao2017VisualATzhang2020crosszhan2021unbalancedzhan2021bilevelFAzhou2021cocosnetzhan2022marginal and (b) our MIDMs include the interleaved process of the matching and generation, which can refine correspondence and embedded feature simultaneously.
  • Figure 2: Overall architecture of MIDMs. For condition image $I_\mathcal{X}$ and exemplar image $I_\mathcal{Y}$, we first compute initial matching and obtain the initial warped feature $\mathcal{R}_{\mathcal{X}\leftarrow{\mathcal{Y}}}$. Then we iteratively compute the diffusion and in-domain alignment with warped feature $r^{n}_{\mathcal{Y}}$ and reference $\mathcal{Y}$ to finally achieve $r^{0}_{\mathcal{Y}}$ that is used to achieve $I_{\mathcal{X}\leftarrow \mathcal{Y}}$.
  • Figure 3: Examples of iterative matching-and-generation process: (a) exemplar image, (b) condition image, (c)-(e) intermediate results of iterative process, which are refined gradually ($n=5, 4, 2$), and (f) final synthesis result ($n=0$).
  • Figure 4: Qualitative results for edge-to-face on CelebA-HQ liu2015deep: (from top to bottom) exemplars, condition and results by CoCosNet zhang2020cross and our MIDMs.
  • Figure 5: Qualitative results for keypoints-to-photos on DeepFashion liu2016deepfashion: (from top to bottom) exemplars, condition and results by CoCosNet zhang2020cross and our MIDMs.
  • ...and 6 more figures