Table of Contents
Fetching ...

Watch Your Steps: Local Image and Scene Editing by Text Instructions

Ashkan Mirzaei, Tristan Aumentado-Armstrong, Marcus A. Brubaker, Jonathan Kelly, Alex Levinshtein, Konstantinos G. Derpanis, Igor Gilitschenski

TL;DR

This work tackles the problem of localizing text-guided edits in both 2D images and 3D scenes edited via NeRFs. It introduces relevance maps derived from the discrepancy between conditional and unconditional IP2P predictions at a fixed noise level, and uses a mask to constrain edits to the most relevant regions; it further extends locality to 3D with a learnable relevance field that guides view-wise NeRF updates for consistent edits. The approach yields state-of-the-art results on image and NeRF editing tasks, improving locality, fidelity, and cross-view consistency, while providing interpretable guidance through relevance maps. The method has practical impact for precise, minimally invasive edits in multimedia content and 3D scenes, with clear trade-offs governed by the relevance threshold and the underlying edit model.

Abstract

Denoising diffusion models have enabled high-quality image generation and editing. We present a method to localize the desired edit region implicit in a text instruction. We leverage InstructPix2Pix (IP2P) and identify the discrepancy between IP2P predictions with and without the instruction. This discrepancy is referred to as the relevance map. The relevance map conveys the importance of changing each pixel to achieve the edits, and is used to to guide the modifications. This guidance ensures that the irrelevant pixels remain unchanged. Relevance maps are further used to enhance the quality of text-guided editing of 3D scenes in the form of neural radiance fields. A field is trained on relevance maps of training views, denoted as the relevance field, defining the 3D region within which modifications should be made. We perform iterative updates on the training views guided by rendered relevance maps from the relevance field. Our method achieves state-of-the-art performance on both image and NeRF editing tasks. Project page: https://ashmrz.github.io/WatchYourSteps/

Watch Your Steps: Local Image and Scene Editing by Text Instructions

TL;DR

This work tackles the problem of localizing text-guided edits in both 2D images and 3D scenes edited via NeRFs. It introduces relevance maps derived from the discrepancy between conditional and unconditional IP2P predictions at a fixed noise level, and uses a mask to constrain edits to the most relevant regions; it further extends locality to 3D with a learnable relevance field that guides view-wise NeRF updates for consistent edits. The approach yields state-of-the-art results on image and NeRF editing tasks, improving locality, fidelity, and cross-view consistency, while providing interpretable guidance through relevance maps. The method has practical impact for precise, minimally invasive edits in multimedia content and 3D scenes, with clear trade-offs governed by the relevance threshold and the underlying edit model.

Abstract

Denoising diffusion models have enabled high-quality image generation and editing. We present a method to localize the desired edit region implicit in a text instruction. We leverage InstructPix2Pix (IP2P) and identify the discrepancy between IP2P predictions with and without the instruction. This discrepancy is referred to as the relevance map. The relevance map conveys the importance of changing each pixel to achieve the edits, and is used to to guide the modifications. This guidance ensures that the irrelevant pixels remain unchanged. Relevance maps are further used to enhance the quality of text-guided editing of 3D scenes in the form of neural radiance fields. A field is trained on relevance maps of training views, denoted as the relevance field, defining the 3D region within which modifications should be made. We perform iterative updates on the training views guided by rendered relevance maps from the relevance field. Our method achieves state-of-the-art performance on both image and NeRF editing tasks. Project page: https://ashmrz.github.io/WatchYourSteps/
Paper Structure (16 sections, 5 equations, 16 figures, 1 table)

This paper contains 16 sections, 5 equations, 16 figures, 1 table.

Figures (16)

  • Figure 1: Overview of the calculation of the relevance map (left inset), and sample outputs on image (top-right inset) and neural radiance field (bottom-right inset) editing guided by the relevance. Given an image or a Neural Radiance Field (NeRF), our goal is to change the input according to a textual instruction. The relevance map is the disagreement between noise predictions with and without the instruction. For both image and scene editing, we use the relevance map to confine the changes to the most relevant region, according to the edit text.
  • Figure 2: Overview of a denoising step for image editing via relevance-guidance. The relevance map is binarized to get the edit mask. After denoising the output of the last stage with IP2P, the unmasked pixels are swapped with the noisy input to ensure consistency to the input throughout the process.
  • Figure 3: Overview of our relevance-guided NeRF editing method. Iteratively, we take a random view and render it using both the main NeRF and the relevance field. The rendered image is edited guided by the rendered relelvance to only change pixels that are highly relevant to the task. IP2P ip2p is used as the backbone of the editing method, and is always conditioned on the initial captures from the scene. This is to prevent drastic drifts from the original scene in the recurrent synthesis process in2n. The relevance-guided image editing module (\ref{['sec:relevance.guided.image.editing']}) returns an edited image and an updated relevance, which are used to update the corresponding training views for the NeRF and the relevance field, respectively.
  • Figure 4: Quantitative image editing evaluation. Our model achieves better text-image direction similarity (x-axis), while maintaining higher fidelity to the input (y-axis). The text-guidance is set to $7.5$ for every method. We pick SDEdit's strength from $[0.1, 0.9]$ and Diffedit's encoding-ratio from $[0.5, 0.9]$. For IP2P, $S_I$ is changed between $[1, 2.2]$. For our method, $s_I$ is set to $1$.
  • Figure 5: Our image editing results compared to IP2P. For both models, $s_T$ and $s_I$ are set to $7.5$ and $1$, respectively. IP2P fails to isolate the specified region, and over-edits the input. Our model explicitly predicts the scope of the edit, and limits the edit inside a specific region.
  • ...and 11 more figures