Table of Contents
Fetching ...

Editable Noise Map Inversion: Encoding Target-image into Noise For High-Fidelity Image Manipulation

Mingyu Kang, Yong Suk Choi

TL;DR

This work tackles the challenge that diffusion-based image editing often degrades editability when inverting real images into noise maps. It introduces Editable Noise Map Inversion (ENM Inversion), which jointly optimizes for noise maps that preserve content while being highly editable by minimizing the gap between reconstructed and edited noise maps through $L_{edit}$ and $L_{prev}$ in the objective $L = L_{prev} + \lambda L_{edit}$, with a denoising step threshold $\tau$. The method extends to video by applying the approach frame-by-frame within Video-P2P and enforcing temporal consistency via cross-frame attention control. Empirical results show ENM Inversion outperforms existing inversion approaches across image and video editing tasks in both preservation and edit fidelity, while offering competitive efficiency. The approach promises practical benefits for high-fidelity, prompt-controlled visual edits and temporally coherent video manipulation in attention-guided diffusion pipelines.

Abstract

Text-to-image diffusion models have achieved remarkable success in generating high-quality and diverse images. Building on these advancements, diffusion models have also demonstrated exceptional performance in text-guided image editing. A key strategy for effective image editing involves inverting the source image into editable noise maps associated with the target image. However, previous inversion methods face challenges in adhering closely to the target text prompt. The limitation arises because inverted noise maps, while enabling faithful reconstruction of the source image, restrict the flexibility needed for desired edits. To overcome this issue, we propose Editable Noise Map Inversion (ENM Inversion), a novel inversion technique that searches for optimal noise maps to ensure both content preservation and editability. We analyze the properties of noise maps for enhanced editability. Based on this analysis, our method introduces an editable noise refinement that aligns with the desired edits by minimizing the difference between the reconstructed and edited noise maps. Extensive experiments demonstrate that ENM Inversion outperforms existing approaches across a wide range of image editing tasks in both preservation and edit fidelity with target prompts. Our approach can also be easily applied to video editing, enabling temporal consistency and content manipulation across frames.

Editable Noise Map Inversion: Encoding Target-image into Noise For High-Fidelity Image Manipulation

TL;DR

This work tackles the challenge that diffusion-based image editing often degrades editability when inverting real images into noise maps. It introduces Editable Noise Map Inversion (ENM Inversion), which jointly optimizes for noise maps that preserve content while being highly editable by minimizing the gap between reconstructed and edited noise maps through and in the objective , with a denoising step threshold . The method extends to video by applying the approach frame-by-frame within Video-P2P and enforcing temporal consistency via cross-frame attention control. Empirical results show ENM Inversion outperforms existing inversion approaches across image and video editing tasks in both preservation and edit fidelity, while offering competitive efficiency. The approach promises practical benefits for high-fidelity, prompt-controlled visual edits and temporally coherent video manipulation in attention-guided diffusion pipelines.

Abstract

Text-to-image diffusion models have achieved remarkable success in generating high-quality and diverse images. Building on these advancements, diffusion models have also demonstrated exceptional performance in text-guided image editing. A key strategy for effective image editing involves inverting the source image into editable noise maps associated with the target image. However, previous inversion methods face challenges in adhering closely to the target text prompt. The limitation arises because inverted noise maps, while enabling faithful reconstruction of the source image, restrict the flexibility needed for desired edits. To overcome this issue, we propose Editable Noise Map Inversion (ENM Inversion), a novel inversion technique that searches for optimal noise maps to ensure both content preservation and editability. We analyze the properties of noise maps for enhanced editability. Based on this analysis, our method introduces an editable noise refinement that aligns with the desired edits by minimizing the difference between the reconstructed and edited noise maps. Extensive experiments demonstrate that ENM Inversion outperforms existing approaches across a wide range of image editing tasks in both preservation and edit fidelity with target prompts. Our approach can also be easily applied to video editing, enabling temporal consistency and content manipulation across frames.

Paper Structure

This paper contains 19 sections, 6 equations, 11 figures, 7 tables, 1 algorithm.

Figures (11)

  • Figure 1: ENM Inversion for real image editing. PNP Inversion (PNPInv) preserves the structure and content of the original image but shows limited editing capability. Our method preserves the details of source image and enables precise modification.
  • Figure 2: The pipeline of image editng. (a) DDIM Inversion transforms the input image into noise maps that allow reconstruction of the original image. Additionally, ENM inversion minimizes the gap with ideal noise by applying editable noise refinement, enabling improved reconstruction and editability. (b) Utilizing attention control, the attention maps from the reconstruction path are transferred to the editing path. Our inversion enhances both editability and preservation, resulting in the desired output image.
  • Figure 3: Relationship Between Editing Performance and Noise Map Differences. Editing performance is evaluated using LPIPS, which measures perceptual similarity, and CLIP score, which assesses alignment with the target prompt. The size of each circle indicates the magnitude of differences between the reconstructed and edited noise maps at the 30th inversion step. Smaller noise map differences correlate with better editing performance.
  • Figure 4: Qualitative comparisons of various inversion methods using Prompt-to-Prompt (P2P) hertz2022prompt. Other inversion techniques result in an inconsistent background or structure with the source image or exhibit limited editing capabilities. Our approach not only retains high fidelity to the source image but also demonstrates superior editing capabilities.
  • Figure 5: Qualitative comparison of our inversion using Video-P2P liu2024video for video editing. Our method demonstrates superior performance in terms of temporal consistency, content maintenance, and editing quality, when modifying backgrounds or objects.
  • ...and 6 more figures