Table of Contents
Fetching ...

A Diffusion Model Translator for Efficient Image-to-Image Translation

Mengfei Xia, Yu Zhou, Ran Yi, Yong-Jin Liu, Wenping Wang

TL;DR

This paper addresses the inefficiency of applying diffusion models to image-to-image translation by proposing a Diffusion Model Translator (DMT) that attaches a lightweight translator at a single preset diffusion timestep, avoiding injection at every denoising step. It provides a theoretical justification showing that transferring the distribution between source and target domains at an intermediate step is feasible, and it introduces a practical method to automatically select the translation timestep. The translator is trained via a variational lower bound, resulting in a Gaussian mapping with a tractable objective, and a reparameterization ties the translator to the shared forward diffusion of both domains. Empirically, DMT achieves competitive or superior image quality across stylization, colorization, segmentation-to-image, and sketch-to-image tasks while delivering substantial speedups (via early translation and DDIM acceleration), validating its practical impact for fast, high-quality conditional image synthesis.

Abstract

Applying diffusion models to image-to-image translation (I2I) has recently received increasing attention due to its practical applications. Previous attempts inject information from the source image into each denoising step for an iterative refinement, thus resulting in a time-consuming implementation. We propose an efficient method that equips a diffusion model with a lightweight translator, dubbed a Diffusion Model Translator (DMT), to accomplish I2I. Specifically, we first offer theoretical justification that in employing the pioneering DDPM work for the I2I task, it is both feasible and sufficient to transfer the distribution from one domain to another only at some intermediate step. We further observe that the translation performance highly depends on the chosen timestep for domain transfer, and therefore propose a practical strategy to automatically select an appropriate timestep for a given task. We evaluate our approach on a range of I2I applications, including image stylization, image colorization, segmentation to image, and sketch to image, to validate its efficacy and general utility. The comparisons show that our DMT surpasses existing methods in both quality and efficiency. Code will be made publicly available.

A Diffusion Model Translator for Efficient Image-to-Image Translation

TL;DR

This paper addresses the inefficiency of applying diffusion models to image-to-image translation by proposing a Diffusion Model Translator (DMT) that attaches a lightweight translator at a single preset diffusion timestep, avoiding injection at every denoising step. It provides a theoretical justification showing that transferring the distribution between source and target domains at an intermediate step is feasible, and it introduces a practical method to automatically select the translation timestep. The translator is trained via a variational lower bound, resulting in a Gaussian mapping with a tractable objective, and a reparameterization ties the translator to the shared forward diffusion of both domains. Empirically, DMT achieves competitive or superior image quality across stylization, colorization, segmentation-to-image, and sketch-to-image tasks while delivering substantial speedups (via early translation and DDIM acceleration), validating its practical impact for fast, high-quality conditional image synthesis.

Abstract

Applying diffusion models to image-to-image translation (I2I) has recently received increasing attention due to its practical applications. Previous attempts inject information from the source image into each denoising step for an iterative refinement, thus resulting in a time-consuming implementation. We propose an efficient method that equips a diffusion model with a lightweight translator, dubbed a Diffusion Model Translator (DMT), to accomplish I2I. Specifically, we first offer theoretical justification that in employing the pioneering DDPM work for the I2I task, it is both feasible and sufficient to transfer the distribution from one domain to another only at some intermediate step. We further observe that the translation performance highly depends on the chosen timestep for domain transfer, and therefore propose a practical strategy to automatically select an appropriate timestep for a given task. We evaluate our approach on a range of I2I applications, including image stylization, image colorization, segmentation to image, and sketch to image, to validate its efficacy and general utility. The comparisons show that our DMT surpasses existing methods in both quality and efficiency. Code will be made publicly available.

Paper Structure

This paper contains 21 sections, 12 theorems, 39 equations, 12 figures, 11 tables, 4 algorithms.

Key Result

Lemma 1

The negative log-likelihood of $-\log p_{\theta}(y_0|x_0)$ has the following upper bound, where $q=q(y_{1:t},x_{1:t}|y_0, x_0).$

Figures (12)

  • Figure 1: Conceptual comparison between (a) existing methods saharia2021palettechoi2021ilvrliu2021more and (b) our DMT. $\{x_t\}_{t=0}^T$ represent different states of the input from the source domain, while $y_T \to y_0$ stands for the denoising process of DDPM. Here, $T$ denotes the total number of noise-adding steps in the diffusion process. Instead of using the information $f_t(x)$ from the source domain (which can be the original or noisy image) for an iterative refinement at each denoising step $t,t=0,1,\cdots,T$, DMT accomplishes the I2I task efficiently by learning an efficient translation module at just one preset timestep and fully reusing the pre-trained DDPM. How to select an appropriate translation timestep is discussed in \ref{['subsec:Optimal']}.
  • Figure 2: Qualitative results of our proposed DMT on four I2I tasks: image stylization, image colorization, segmentation to image, and sketch to image. Here we equip a pre-trained DDPM with an efficient translation module. Our approach makes adequate use of the content information from the input condition as well as the domain knowledge contained in the learned denoising process.
  • Figure 3: Analysis on the preset timestep, $t$. Our DMT needs a pre-defined timestep to learn and perform the distribution shift. We plot the distance between $(x_t,y_t)$ and $(x_0, x_t)$ at different timesteps, which are shown in red and blue curves, respectively. When $t$ increases, $d(x_t, y_t)$ decreases so that the distribution is easier to shift from $x_t$ to $y_t$, while $d(x_0, x_t)$ increases so that the input condition signal is becoming less relevant because $x_t$ is drifting away from the input $x_0$. Considering such a trade-off, we select the intersection as the practical choice of the timestep for DMT learning.
  • Figure 4: Conceptual comparison for (a) multi-step DMT and (b) asymmetric DMT. $\{x_t\}_{t=0}^T$ represent different states of the input from the source domain, while $y_T \to y_0$ stands for the denoising process of DDPM. Here, $T$ denotes the total number of noise-adding steps in the diffusion process. Multi-step DMT combines the translation results of DMT at two different timesteps with an auxiliary fusion UNet and denoise to achieve the final output, while asymmetric DMT applies translation at different timestep pair $(s,t)$. More discussions are addressed in \ref{['subsec:generalization']} and Supplementary Material.
  • Figure 5: Qualitative comparison between DMT and SPADE park2019SPADE on segmentation-to-image task. Our proposed DMT achieves better image quality and content consisitency compared with SPADE.
  • ...and 7 more figures

Theorems & Definitions (18)

  • Lemma 1
  • Theorem 1: Closed-form expression
  • Theorem 2: Optimal solution to \ref{['eq:3.4']}
  • Lemma 2
  • Theorem 3: Closed-form expression
  • Theorem 4: Optimal solution to \ref{['eq:3.9']}
  • Lemma 1
  • proof
  • Theorem 1: Closed-form expression
  • proof
  • ...and 8 more