Table of Contents
Fetching ...

MirrorDiffusion: Stabilizing Diffusion Process in Zero-shot Image Translation by Prompts Redescription and Beyond

Yupei Lin, Xiaoyu Xian, Yukai Shi, Liang Lin

TL;DR

Zero-shot diffusion-based image translation often suffers from displacement during diffusion-inversion, causing structure drift between input and reconstructed outputs. MirrorDiffusion introduces a prompt redescription mechanism that aligns prompts with latent codes at each DDIM inversion step, creating a mirror relation between the source and reconstructed images. It optimizes a latent-prompt alignment objective $L_{rewrite}$ and updates the rewritten prompt $c_{rewrite}$ using $c_{rewrite} = c_t - \lambda \nabla_c \mathcal{L}_{rewrite}$, guided by a CLIP-based domain gap $\Delta c$, and samples with $z'_{t-1} = \mathrm{Sample}(\epsilon_\theta, z_t, c_{rewrite} + \Delta c, t)$. Experiments on LAION-5B-derived tasks demonstrate superior translation quality and structure preservation over baselines, with improved stability, establishing a practical approach for reliable zero-shot diffusion-based translation with minimal supervision.

Abstract

Recently, text-to-image diffusion models become a new paradigm in image processing fields, including content generation, image restoration and image-to-image translation. Given a target prompt, Denoising Diffusion Probabilistic Models (DDPM) are able to generate realistic yet eligible images. With this appealing property, the image translation task has the potential to be free from target image samples for supervision. By using a target text prompt for domain adaption, the diffusion model is able to implement zero-shot image-to-image translation advantageously. However, the sampling and inversion processes of DDPM are stochastic, and thus the inversion process often fail to reconstruct the input content. Specifically, the displacement effect will gradually accumulated during the diffusion and inversion processes, which led to the reconstructed results deviating from the source domain. To make reconstruction explicit, we propose a prompt redescription strategy to realize a mirror effect between the source and reconstructed image in the diffusion model (MirrorDiffusion). More specifically, a prompt redescription mechanism is investigated to align the text prompts with latent code at each time step of the Denoising Diffusion Implicit Models (DDIM) inversion to pursue a structure-preserving reconstruction. With the revised DDIM inversion, MirrorDiffusion is able to realize accurate zero-shot image translation by editing optimized text prompts and latent code. Extensive experiments demonstrate that MirrorDiffusion achieves superior performance over the state-of-the-art methods on zero-shot image translation benchmarks by clear margins and practical model stability.

MirrorDiffusion: Stabilizing Diffusion Process in Zero-shot Image Translation by Prompts Redescription and Beyond

TL;DR

Zero-shot diffusion-based image translation often suffers from displacement during diffusion-inversion, causing structure drift between input and reconstructed outputs. MirrorDiffusion introduces a prompt redescription mechanism that aligns prompts with latent codes at each DDIM inversion step, creating a mirror relation between the source and reconstructed images. It optimizes a latent-prompt alignment objective and updates the rewritten prompt using , guided by a CLIP-based domain gap , and samples with . Experiments on LAION-5B-derived tasks demonstrate superior translation quality and structure preservation over baselines, with improved stability, establishing a practical approach for reliable zero-shot diffusion-based translation with minimal supervision.

Abstract

Recently, text-to-image diffusion models become a new paradigm in image processing fields, including content generation, image restoration and image-to-image translation. Given a target prompt, Denoising Diffusion Probabilistic Models (DDPM) are able to generate realistic yet eligible images. With this appealing property, the image translation task has the potential to be free from target image samples for supervision. By using a target text prompt for domain adaption, the diffusion model is able to implement zero-shot image-to-image translation advantageously. However, the sampling and inversion processes of DDPM are stochastic, and thus the inversion process often fail to reconstruct the input content. Specifically, the displacement effect will gradually accumulated during the diffusion and inversion processes, which led to the reconstructed results deviating from the source domain. To make reconstruction explicit, we propose a prompt redescription strategy to realize a mirror effect between the source and reconstructed image in the diffusion model (MirrorDiffusion). More specifically, a prompt redescription mechanism is investigated to align the text prompts with latent code at each time step of the Denoising Diffusion Implicit Models (DDIM) inversion to pursue a structure-preserving reconstruction. With the revised DDIM inversion, MirrorDiffusion is able to realize accurate zero-shot image translation by editing optimized text prompts and latent code. Extensive experiments demonstrate that MirrorDiffusion achieves superior performance over the state-of-the-art methods on zero-shot image translation benchmarks by clear margins and practical model stability.
Paper Structure (11 sections, 6 equations, 9 figures, 2 tables)

This paper contains 11 sections, 6 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Without any supervision, MirrorDiffusion realized three zero-shot image-to-image translations: w/o glasses $\rightarrow$ w/ glasses, Male $\rightarrow$ Female and Fox $\rightarrow$ Dog.
  • Figure 2: To show displacement effect, the reconstruction process of typical DDIM work parmar2023zero is visualized in Fig.\ref{['fig:rec_res']} (a), which can be formulated as: $z_0\to z_T\to z'_0$. However, errors accumulate in typical diffusion methods, causing biases in latent codes $[z_0, z'_0]$ and deviations in $[I_{source}, I_{reco}]$. To align the latent codes, we propose a prompt redescription mechanism to realize a mirror effect between the source and reconstructed image in the diffusion model (MirrorDiffusion).
  • Figure 3: Visualization results. Compared with state-of-the-art diffusion approaches across four tasks, our method excels in generating highly realistic translation results with excellent structure consistency.
  • Figure 4: The framework overview of MirrorDiffusion. With the prompt redescription mechanism, our model obtains the firmly aligned $[z_0, z'_0]$, $[I_{source}, I_{reco}]$ combinations. We apply CLIP radford2021learning to compute the domain gap $\Delta c$ between the source domain and target domain for image editing. Specifically, the CLIP radford2021learning is used to extract the high-level features of source domain sentences and target domain sentences, respectively. And the mean difference, which is computed along those features, is represented as the domain gap $\Delta c$. Then, we apply the target text embedding $c_{rewrite} + \Delta c$ for zero-shot image translation with diffusion inversion process. With $T$-time inversion, MirrorDiffusion can obtain the corresponding latent code $z'_0$, which corresponds to $I_{trans}$ with $Dec(\cdot)$.
  • Figure 5: Attention maps of 'w/o $\mathcal{L}_{rewrite}$' and 'w/ $\mathcal{L}_{rewrite}$' during reconstruction process.
  • ...and 4 more figures