Table of Contents
Fetching ...

Noise Map Guidance: Inversion with Spatial Context for Real Image Editing

Hansam Cho, Jonghyun Lee, Seoung Bum Kim, Tae-Hyun Oh, Yonghyun Jeong

TL;DR

Noise Map Guidance (NMG) tackles real-image editing with text-guided diffusion models by introducing an optimization-free inversion that leverages spatial context from DDIM noise maps. By conditioning the reverse process on both noise maps and text embeddings through energy guidance and gradient scaling, NMG preserves input spatial structure while enabling faithful edits. It seamlessly integrates with diverse editing techniques (e.g., Prompt-to-Prompt, MasaCtrl, pix2pix-zero) and remains robust across DDIM inversion variants, delivering faster reconstruction than NTI without compromising quality. The results show improved local and global edits, strong quantitative metrics, and favorable human judgments, highlighting NMG’s practical impact for reliable, high-fidelity real-image editing in diffusion-based frameworks.

Abstract

Text-guided diffusion models have become a popular tool in image synthesis, known for producing high-quality and diverse images. However, their application to editing real images often encounters hurdles primarily due to the text condition deteriorating the reconstruction quality and subsequently affecting editing fidelity. Null-text Inversion (NTI) has made strides in this area, but it fails to capture spatial context and requires computationally intensive per-timestep optimization. Addressing these challenges, we present Noise Map Guidance (NMG), an inversion method rich in a spatial context, tailored for real-image editing. Significantly, NMG achieves this without necessitating optimization, yet preserves the editing quality. Our empirical investigations highlight NMG's adaptability across various editing techniques and its robustness to variants of DDIM inversions.

Noise Map Guidance: Inversion with Spatial Context for Real Image Editing

TL;DR

Noise Map Guidance (NMG) tackles real-image editing with text-guided diffusion models by introducing an optimization-free inversion that leverages spatial context from DDIM noise maps. By conditioning the reverse process on both noise maps and text embeddings through energy guidance and gradient scaling, NMG preserves input spatial structure while enabling faithful edits. It seamlessly integrates with diverse editing techniques (e.g., Prompt-to-Prompt, MasaCtrl, pix2pix-zero) and remains robust across DDIM inversion variants, delivering faster reconstruction than NTI without compromising quality. The results show improved local and global edits, strong quantitative metrics, and favorable human judgments, highlighting NMG’s practical impact for reliable, high-fidelity real-image editing in diffusion-based frameworks.

Abstract

Text-guided diffusion models have become a popular tool in image synthesis, known for producing high-quality and diverse images. However, their application to editing real images often encounters hurdles primarily due to the text condition deteriorating the reconstruction quality and subsequently affecting editing fidelity. Null-text Inversion (NTI) has made strides in this area, but it fails to capture spatial context and requires computationally intensive per-timestep optimization. Addressing these challenges, we present Noise Map Guidance (NMG), an inversion method rich in a spatial context, tailored for real-image editing. Significantly, NMG achieves this without necessitating optimization, yet preserves the editing quality. Our empirical investigations highlight NMG's adaptability across various editing techniques and its robustness to variants of DDIM inversions.
Paper Structure (38 sections, 12 equations, 12 figures, 3 tables)

This paper contains 38 sections, 12 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Compared to other inversion methods, NMG (a) demonstrates high fidelity editing when paired with Prompt-to-Prompt, (b) successfully conducts viewpoint alteration via MasaCtrl, and (c) preserves the spatial context of the input image while performing zero-shot image-to-image translation with pix2pix-zero. Text prompt corresponding to each input image is presented beneath each sample, with words introduced for image editing distinctly highlighted in green.
  • Figure 2: As seen in (a), naive reconstruction often fails due to the reconstruction path diverging from the original inversion path. Achieving reliable reconstruction necessitates realigning the reconstruction path with the inversion path. As depicted in (b), NTI achieves this alignment by optimizing the null-text embedding, thereby reducing the error between the inversion and reconstruction paths. Conversely, NMG, as shown in (c), conditions the reconstruction process based on the divergence between the two paths, leveraging this variance to refine the reconstruction path.
  • Figure 3: Image editing results using Prompt-to-Prompt are shown in (a) for local editing and (b) for global editing. Results show that DDIM lacks in preserving details of the input image, both NTI and NPI face challenges in maintaining spatial context, and ProxNPI exhibits limited editing capabilities. In contrast, NMG consistently produces robust results for both local and global edits.
  • Figure 4: Image editing outcomes are presented using (a) MasaCtrl and (b) pix2pix-zero. NMG's proficiency in retaining spatial context is highlighted in (a), while its resilience to variations of DDIM inversion is showcased in (b).
  • Figure 5: Ablation results of (a) guidance scales and (b) gradient scales. In (a), we demonstrate that the noise map guidance scale governs the influence of input image nuances, while the text guidance scale steers the extent of edits in the desired direction. In (b), we demonstrate that the gradient scale regulates the degree of alignment with the inversion trajectory.
  • ...and 7 more figures