Watch Your Steps: Local Image and Scene Editing by Text Instructions
Ashkan Mirzaei, Tristan Aumentado-Armstrong, Marcus A. Brubaker, Jonathan Kelly, Alex Levinshtein, Konstantinos G. Derpanis, Igor Gilitschenski
TL;DR
This work tackles the problem of localizing text-guided edits in both 2D images and 3D scenes edited via NeRFs. It introduces relevance maps derived from the discrepancy between conditional and unconditional IP2P predictions at a fixed noise level, and uses a mask to constrain edits to the most relevant regions; it further extends locality to 3D with a learnable relevance field that guides view-wise NeRF updates for consistent edits. The approach yields state-of-the-art results on image and NeRF editing tasks, improving locality, fidelity, and cross-view consistency, while providing interpretable guidance through relevance maps. The method has practical impact for precise, minimally invasive edits in multimedia content and 3D scenes, with clear trade-offs governed by the relevance threshold and the underlying edit model.
Abstract
Denoising diffusion models have enabled high-quality image generation and editing. We present a method to localize the desired edit region implicit in a text instruction. We leverage InstructPix2Pix (IP2P) and identify the discrepancy between IP2P predictions with and without the instruction. This discrepancy is referred to as the relevance map. The relevance map conveys the importance of changing each pixel to achieve the edits, and is used to to guide the modifications. This guidance ensures that the irrelevant pixels remain unchanged. Relevance maps are further used to enhance the quality of text-guided editing of 3D scenes in the form of neural radiance fields. A field is trained on relevance maps of training views, denoted as the relevance field, defining the 3D region within which modifications should be made. We perform iterative updates on the training views guided by rendered relevance maps from the relevance field. Our method achieves state-of-the-art performance on both image and NeRF editing tasks. Project page: https://ashmrz.github.io/WatchYourSteps/
