Table of Contents
Fetching ...

Diffusion-Based Attention Warping for Consistent 3D Scene Editing

Eyal Gomel, Lior Wolf

TL;DR

A novel method for 3D scene editing using diffusion models, designed to ensure view consistency and realism across perspectives, is presented, significantly advancing the capabilities of scene manipulation compared to the existing methods.

Abstract

We present a novel method for 3D scene editing using diffusion models, designed to ensure view consistency and realism across perspectives. Our approach leverages attention features extracted from a single reference image to define the intended edits. These features are warped across multiple views by aligning them with scene geometry derived from Gaussian splatting depth estimates. Injecting these warped features into other viewpoints enables coherent propagation of edits, achieving high fidelity and spatial alignment in 3D space. Extensive evaluations demonstrate the effectiveness of our method in generating versatile edits of 3D scenes, significantly advancing the capabilities of scene manipulation compared to the existing methods. Project page: \url{https://attention-warp.github.io}

Diffusion-Based Attention Warping for Consistent 3D Scene Editing

TL;DR

A novel method for 3D scene editing using diffusion models, designed to ensure view consistency and realism across perspectives, is presented, significantly advancing the capabilities of scene manipulation compared to the existing methods.

Abstract

We present a novel method for 3D scene editing using diffusion models, designed to ensure view consistency and realism across perspectives. Our approach leverages attention features extracted from a single reference image to define the intended edits. These features are warped across multiple views by aligning them with scene geometry derived from Gaussian splatting depth estimates. Injecting these warped features into other viewpoints enables coherent propagation of edits, achieving high fidelity and spatial alignment in 3D space. Extensive evaluations demonstrate the effectiveness of our method in generating versatile edits of 3D scenes, significantly advancing the capabilities of scene manipulation compared to the existing methods. Project page: \url{https://attention-warp.github.io}

Paper Structure

This paper contains 14 sections, 7 equations, 17 figures, 4 tables, 1 algorithm.

Figures (17)

  • Figure 1: Overview of our method. A single source image is edited using a 2D diffusion model that is conditioned on some prompt. The attention feature maps employed during this process are saved. Given a new reference view, the maps are warped to this view based on the 3D depth map of the reference view. A diffusion model is then applied to the reference view using a blending of the attention feature maps obtained during the diffusion process itself and those that arise from the source view.
  • Figure 2: A comparison of scene editing methods across various scenes is presented. Each sample shows two views, with the modified source image shown as an inset. Additional examples are provided in Figs. \ref{['fig:main_vis_bear']}, \ref{['fig:main_vis_dino']}, \ref{['fig:main_vis_face']}, \ref{['fig:main_vis_person']} and \ref{['fig:main_vis_table']}.
  • Figure 3: Obtaining variability. Our method (ControlNet variant) with different random seeds to produce diverse stylistic variations. Each row illustrates how varying the random seed impacts the visual output, resulting in unique edits while preserving the overall content structure. Additionally, the figure includes the warped feature map of the source view (left column) to provide insight into how the attention is distributed across the edited images.
  • Figure IV: The figure presents a comparison of different style edits based on various source image editing approaches, using different random seeds. Each method is evaluated with three different seeds. For each part, the top row displays the edited source image, while the two rows below show novel views generated from the edited model. GC=GaussCtrl, CN=ControlNet.
  • Figure V: User-generated edits comparison between our method and the DGE method. The first row shows the user-provided edited image followed by three novel views generated using our method. The second row displays the same using the DGE method. This comparison highlights the differences in edit quality and consistency between the two approaches.
  • ...and 12 more figures