Table of Contents
Fetching ...

There and Back Again: On the relation between Noise and Image Inversions in Diffusion Models

Łukasz Staniszewski, Łukasz Kuciński, Kamil Deja

TL;DR

This work analyzes DDIM inversion in diffusion models to reveal that early inversion steps produce biased, less diverse noise predictions in plain image regions, causing latents to deviate from Gaussian statistics and exhibit correlations. The authors show that these divergences reduce the manipulability of latent encodings for editing and interpolation. They propose a simple fix: replace the first few inversion steps with forward diffusion, which decorrelates latents and improves editing, interpolation quality, and stochastic editing of real images with minimal reconstruction cost. The approach is validated across multiple models and tasks, offering a practical method to enhance controllability of diffusion-based image editing and interpolation, while providing open-source code.

Abstract

Diffusion Models achieve state-of-the-art performance in generating new samples but lack a low-dimensional latent space that encodes the data into editable features. Inversion-based methods address this by reversing the denoising trajectory, transferring images to their approximated starting noise. In this work, we thoroughly analyze this procedure and focus on the relation between the initial noise, the generated samples, and their corresponding latent encodings obtained through the DDIM inversion. First, we show that latents exhibit structural patterns in the form of less diverse noise predicted for smooth image areas (e.g., plain sky). Through a series of analyses, we trace this issue to the first inversion steps, which fail to provide accurate and diverse noise. Consequently, the DDIM inversion space is notably less manipulative than the original noise. We show that prior inversion methods do not fully resolve this issue, but our simple fix, where we replace the first DDIM Inversion steps with a forward diffusion process, successfully decorrelates latent encodings and enables higher quality editions and interpolations. The code is available at https://github.com/luk-st/taba.

There and Back Again: On the relation between Noise and Image Inversions in Diffusion Models

TL;DR

This work analyzes DDIM inversion in diffusion models to reveal that early inversion steps produce biased, less diverse noise predictions in plain image regions, causing latents to deviate from Gaussian statistics and exhibit correlations. The authors show that these divergences reduce the manipulability of latent encodings for editing and interpolation. They propose a simple fix: replace the first few inversion steps with forward diffusion, which decorrelates latents and improves editing, interpolation quality, and stochastic editing of real images with minimal reconstruction cost. The approach is validated across multiple models and tasks, offering a practical method to enhance controllability of diffusion-based image editing and interpolation, while providing open-source code.

Abstract

Diffusion Models achieve state-of-the-art performance in generating new samples but lack a low-dimensional latent space that encodes the data into editable features. Inversion-based methods address this by reversing the denoising trajectory, transferring images to their approximated starting noise. In this work, we thoroughly analyze this procedure and focus on the relation between the initial noise, the generated samples, and their corresponding latent encodings obtained through the DDIM inversion. First, we show that latents exhibit structural patterns in the form of less diverse noise predicted for smooth image areas (e.g., plain sky). Through a series of analyses, we trace this issue to the first inversion steps, which fail to provide accurate and diverse noise. Consequently, the DDIM inversion space is notably less manipulative than the original noise. We show that prior inversion methods do not fully resolve this issue, but our simple fix, where we replace the first DDIM Inversion steps with a forward diffusion process, successfully decorrelates latent encodings and enables higher quality editions and interpolations. The code is available at https://github.com/luk-st/taba.

Paper Structure

This paper contains 47 sections, 15 equations, 30 figures, 13 tables, 1 algorithm.

Figures (30)

  • Figure 1: DDIM inversion produces latent encodings that exhibit less diverse noise in the smooth image areas than in the non-plain one. We attribute this problem to the errors of noise prediction in the first inversion steps.
  • Figure 2: Mean of top-20 Pearson correlation coefficients inside $8\times8$ patches for random Gaussian noises, latent encodings, and generations. DDIM Latents are much more correlated than noises.
  • Figure 3: LDM
  • Figure 4: Structures can be removed from DDIM latents by replacing inversion steps with forward diffusion. Using forward diffusion instead of the first $4\%$ of inversion steps brings the resulting latents closer to Gaussian noise without a major degradation in the image reconstruction.
  • Figure 5: Latent encodings exhibit image patterns. For small pixel-space models (a), we observe correlations directly in the inversion results. For larger models (e.g., LDMs), the same patterns can be observed in the absolute errors between the latent and noise (b). This observation also holds for LDM models operating on $4$-channels, where we use PCA for visualization (c).
  • ...and 25 more figures