Table of Contents
Fetching ...

An Edit Friendly DDPM Noise Space: Inversion and Manipulations

Inbar Huberman-Spiegelglas, Vladimir Kulikov, Tomer Michaeli

TL;DR

The paper tackles editing real images with diffusion models by introducing an edit-friendly DDPM noise space and a fast inversion that yields a sequence of latent noise maps capable of perfect reconstruction. By constructing an auxiliary diffusion path with iid perturbations, the method produces noise maps that imprint image structure more strongly and exhibit higher variance, enabling structure-preserving edits when the maps are fixed and the condition is changed. The approach supports diverse text-guided edits and can be integrated with existing diffusion-based editing techniques (e.g., P2P, Zero-Shot I2I, DDIM-based methods) to improve fidelity and variety without slow optimization or fine-tuning. Empirical results on modified ImageNet-R-TI2I and Zero-Shot I2I datasets show favorable LPIPS/CLIP trade-offs, faster edit times, and enhanced texture preservation, highlighting its practical impact for robust, editable diffusion-based image editing.

Abstract

Denoising diffusion probabilistic models (DDPMs) employ a sequence of white Gaussian noise samples to generate an image. In analogy with GANs, those noise maps could be considered as the latent code associated with the generated image. However, this native noise space does not possess a convenient structure, and is thus challenging to work with in editing tasks. Here, we propose an alternative latent noise space for DDPM that enables a wide range of editing operations via simple means, and present an inversion method for extracting these edit-friendly noise maps for any given image (real or synthetically generated). As opposed to the native DDPM noise space, the edit-friendly noise maps do not have a standard normal distribution and are not statistically independent across timesteps. However, they allow perfect reconstruction of any desired image, and simple transformations on them translate into meaningful manipulations of the output image (e.g. shifting, color edits). Moreover, in text-conditional models, fixing those noise maps while changing the text prompt, modifies semantics while retaining structure. We illustrate how this property enables text-based editing of real images via the diverse DDPM sampling scheme (in contrast to the popular non-diverse DDIM inversion). We also show how it can be used within existing diffusion-based editing methods to improve their quality and diversity. Webpage: https://inbarhub.github.io/DDPM_inversion

An Edit Friendly DDPM Noise Space: Inversion and Manipulations

TL;DR

The paper tackles editing real images with diffusion models by introducing an edit-friendly DDPM noise space and a fast inversion that yields a sequence of latent noise maps capable of perfect reconstruction. By constructing an auxiliary diffusion path with iid perturbations, the method produces noise maps that imprint image structure more strongly and exhibit higher variance, enabling structure-preserving edits when the maps are fixed and the condition is changed. The approach supports diverse text-guided edits and can be integrated with existing diffusion-based editing techniques (e.g., P2P, Zero-Shot I2I, DDIM-based methods) to improve fidelity and variety without slow optimization or fine-tuning. Empirical results on modified ImageNet-R-TI2I and Zero-Shot I2I datasets show favorable LPIPS/CLIP trade-offs, faster edit times, and enhanced texture preservation, highlighting its practical impact for robust, editable diffusion-based image editing.

Abstract

Denoising diffusion probabilistic models (DDPMs) employ a sequence of white Gaussian noise samples to generate an image. In analogy with GANs, those noise maps could be considered as the latent code associated with the generated image. However, this native noise space does not possess a convenient structure, and is thus challenging to work with in editing tasks. Here, we propose an alternative latent noise space for DDPM that enables a wide range of editing operations via simple means, and present an inversion method for extracting these edit-friendly noise maps for any given image (real or synthetically generated). As opposed to the native DDPM noise space, the edit-friendly noise maps do not have a standard normal distribution and are not statistically independent across timesteps. However, they allow perfect reconstruction of any desired image, and simple transformations on them translate into meaningful manipulations of the output image (e.g. shifting, color edits). Moreover, in text-conditional models, fixing those noise maps while changing the text prompt, modifies semantics while retaining structure. We illustrate how this property enables text-based editing of real images via the diverse DDPM sampling scheme (in contrast to the popular non-diverse DDIM inversion). We also show how it can be used within existing diffusion-based editing methods to improve their quality and diversity. Webpage: https://inbarhub.github.io/DDPM_inversion
Paper Structure (26 sections, 7 equations, 22 figures, 4 tables, 1 algorithm)

This paper contains 26 sections, 7 equations, 22 figures, 4 tables, 1 algorithm.

Figures (22)

  • Figure 1: The native and edit friendly noise spaces. When sampling an image using DDPM (left), there is access to the "ground truth" noise maps that generated it. This native noise space, however, is not edit friendly (2nd column). For example, fixing those noise maps and changing the text prompt, changes the image structure (top). Similarly, flipping (middle) or shifting (bottom) the noise maps completely modifies the image. By contrast, our edit friendly noise maps enable editing while preserving structure (right).
  • Figure 2: The DDPM latent noise space. In DDPM, the generative (reverse) diffusion process synthesizes an image $x_0$ in $T$ steps, by utilizing $T+1$ noise maps, $\{x_T,z_T,\ldots,z_1\}$. We regard those noise maps as the latent code associated with the generated image.
  • Figure 3: DDPM inversion via CycleDiffusion vs. our method. CycleDiffusion's inversion Wu22 extracts a sequence of noise maps $\{x_T,z_T,\ldots,z_1\}$ whose joint distribution is close to that used in regular sampling. However, fixing this latent code and replacing the text prompt fails to preserve the image structure. Our inversion deviates from the regular sampling distribution, but better encodes the image structure.
  • Figure 4: Regular vs. edit-friendly diffusion. In the regular generative process (top left), the noise vectors (red) are statistically independent across timesteps and thus the angle between consecutive vectors is uniformly distributed in $[0,180^\circ]$ (bottom left). In our dynamics (top right) the noise vectors have higher variances and are negatively correlated across consecutive times (bottom right).
  • Figure 5: Native vs. edit friendly noise statistics. Here we show the per-pixel standard deviations of $\{z_t\}$ and the per-pixel correlation between them for model-generated images.
  • ...and 17 more figures