Table of Contents
Fetching ...

CycleDiff: Cycle Diffusion Models for Unpaired Image-to-image Translation

Shilong Zou, Yuhang Huang, Renjiao Yi, Chenyang Zhu, Kai Xu

TL;DR

CycleDiff addresses unpaired image-to-image translation between domains $\mathcal{S}$ and $\mathcal{T}$ by embedding a cycle-consistent translator inside diffusion models and optimizing diffusion and translation jointly. It extracts clean image components $C^{\mathcal{S}}_{t}$ and $C^{\mathcal{T}}_{t}$ from domain-specific diffusion models and applies a time-dependent translator at each denoising step $t$, enabling multi-step, structure-preserving cross-domain translation with networks $G_\phi$ and $F_\psi$. Empirical results on RGB$\leftrightarrow$RGB and cross-modality tasks show state-of-the-art FID/KID and competitive SSIM, with ablations validating the contributions of joint learning, image components, and the time-aware translator. The method provides a scalable framework for unpaired domain translation and can extend to sim-to-real and broader cross-modality applications.

Abstract

We introduce a diffusion-based cross-domain image translator in the absence of paired training data. Unlike GAN-based methods, our approach integrates diffusion models to learn the image translation process, allowing for more coverable modeling of the data distribution and performance improvement of the cross-domain translation. However, incorporating the translation process within the diffusion process is still challenging since the two processes are not aligned exactly, i.e., the diffusion process is applied to the noisy signal while the translation process is conducted on the clean signal. As a result, recent diffusion-based studies employ separate training or shallow integration to learn the two processes, yet this may cause the local minimal of the translation optimization, constraining the effectiveness of diffusion models. To address the problem, we propose a novel joint learning framework that aligns the diffusion and the translation process, thereby improving the global optimality. Specifically, we propose to extract the image components with diffusion models to represent the clean signal and employ the translation process with the image components, enabling an end-to-end joint learning manner. On the other hand, we introduce a time-dependent translation network to learn the complex translation mapping, resulting in effective translation learning and significant performance improvement. Benefiting from the design of joint learning, our method enables global optimization of both processes, enhancing the optimality and achieving improved fidelity and structural consistency. We have conducted extensive experiments on RGB$\leftrightarrow$RGB and diverse cross-modality translation tasks including RGB$\leftrightarrow$Edge, RGB$\leftrightarrow$Semantics and RGB$\leftrightarrow$Depth, showcasing better generative performances than the state of the arts.

CycleDiff: Cycle Diffusion Models for Unpaired Image-to-image Translation

TL;DR

CycleDiff addresses unpaired image-to-image translation between domains and by embedding a cycle-consistent translator inside diffusion models and optimizing diffusion and translation jointly. It extracts clean image components and from domain-specific diffusion models and applies a time-dependent translator at each denoising step , enabling multi-step, structure-preserving cross-domain translation with networks and . Empirical results on RGBRGB and cross-modality tasks show state-of-the-art FID/KID and competitive SSIM, with ablations validating the contributions of joint learning, image components, and the time-aware translator. The method provides a scalable framework for unpaired domain translation and can extend to sim-to-real and broader cross-modality applications.

Abstract

We introduce a diffusion-based cross-domain image translator in the absence of paired training data. Unlike GAN-based methods, our approach integrates diffusion models to learn the image translation process, allowing for more coverable modeling of the data distribution and performance improvement of the cross-domain translation. However, incorporating the translation process within the diffusion process is still challenging since the two processes are not aligned exactly, i.e., the diffusion process is applied to the noisy signal while the translation process is conducted on the clean signal. As a result, recent diffusion-based studies employ separate training or shallow integration to learn the two processes, yet this may cause the local minimal of the translation optimization, constraining the effectiveness of diffusion models. To address the problem, we propose a novel joint learning framework that aligns the diffusion and the translation process, thereby improving the global optimality. Specifically, we propose to extract the image components with diffusion models to represent the clean signal and employ the translation process with the image components, enabling an end-to-end joint learning manner. On the other hand, we introduce a time-dependent translation network to learn the complex translation mapping, resulting in effective translation learning and significant performance improvement. Benefiting from the design of joint learning, our method enables global optimization of both processes, enhancing the optimality and achieving improved fidelity and structural consistency. We have conducted extensive experiments on RGBRGB and diverse cross-modality translation tasks including RGBEdge, RGBSemantics and RGBDepth, showcasing better generative performances than the state of the arts.

Paper Structure

This paper contains 19 sections, 13 equations, 9 figures, 6 tables, 2 algorithms.

Figures (9)

  • Figure 1: The proposed CycleDiff consists of two domain-specific diffusion models and a cycle translator, and learns the diffusion and translation processes jointly. The cycle translator consists of two translation network used for performing cycle translation between two domains: $G_\phi:\mathcal{S}\to\mathcal{T}$ and $F_\psi: \mathcal{T}\to\mathcal{S}$. We employ the cycle consistency constrain to regularize the forward and backward translation mappings. Utilizing only unpaired images, CycleDiff can synthesize structure-consistent and photo-realistic results across different modalities of images.
  • Figure 2: The overall architecture of CycleDiff. CycleDiff comprises two parts: the diffusion models and the cycle translator. The diffusion models are employed to extract image components, which are then fed into the cycle translator for unpaired translation between two domains. The diffusion and translation processes are learned jointly.
  • Figure 3: Comparison of decoupled diffusion model and traditional diffusion model ho2020ddpm (with 'x0 prediction'). The decoupled diffusion model can isolate clean components from the noisy input, while the estimates of the traditional diffusion model are still noisy and not suitable for the subsequent transformation process.
  • Figure 4: Qualitative comparisons on RGB $\leftrightarrow$ RGB tasks with state-of-the-art methods. CycleDiff could achieve superior visual results for both realism and faithfulness across all tasks. For example, in the fourth row, our method effectively retains the features that are independent of the domain, such as the white ground, while eliminating those that are specific to the domain, such as the shape of the eyebrows and mouth.
  • Figure 5: More visual results on additional datasets of CycleDiff. Our method could produce high fidelity results both on time-varying datasets and challenging artificial domain data.
  • ...and 4 more figures