Table of Contents
Fetching ...

DGE: Direct Gaussian 3D Editing by Consistent Multi-view Editing

Minghao Chen, Iro Laina, Andrea Vedaldi

TL;DR

This work tackles language-guided editing of 3D scenes by addressing the inefficiencies of relying on 2D editors that yield view-inconsistent edits. It introduces Direct Gaussian Editor (DGE), which uses a fast 3D Gaussian Splatting representation and a two-stage process: first achieving multi-view consistency by extending a diffusion-based editor with spatio-temporal attention and epipolar constraints, then directly fitting the edited views to the 3D model via a rendering-based objective written as $\min_{\mathcal{G}} \sum_{t=1}^T \| I'_t - \operatorname{Rend}(\mathcal{G}, \pi_t) \|$ with LPIPS guidance. The approach enables selective editing since Gaussians can be masked and edited locally, and it yields substantial speedups (e.g., around 4 minutes per edit) with higher fidelity compared to prior methods like IN2N, GaussianEditor, and ViCa-NeRF. Experiments on real and synthetic datasets show improved CLIP-based alignment and texture detail, validating multi-view consistency and epipolar-guided feature propagation as key factors for effective 3D editing with diffusion priors.

Abstract

We consider the problem of editing 3D objects and scenes based on open-ended language instructions. A common approach to this problem is to use a 2D image generator or editor to guide the 3D editing process, obviating the need for 3D data. However, this process is often inefficient due to the need for iterative updates of costly 3D representations, such as neural radiance fields, either through individual view edits or score distillation sampling. A major disadvantage of this approach is the slow convergence caused by aggregating inconsistent information across views, as the guidance from 2D models is not multi-view consistent. We thus introduce the Direct Gaussian Editor (DGE), a method that addresses these issues in two stages. First, we modify a given high-quality image editor like InstructPix2Pix to be multi-view consistent. To do so, we propose a training-free approach that integrates cues from the 3D geometry of the underlying scene. Second, given a multi-view consistent edited sequence of images, we directly and efficiently optimize the 3D representation, which is based on 3D Gaussian Splatting. Because it avoids incremental and iterative edits, DGE is significantly more accurate and efficient than existing approaches and offers additional benefits, such as enabling selective editing of parts of the scene.

DGE: Direct Gaussian 3D Editing by Consistent Multi-view Editing

TL;DR

This work tackles language-guided editing of 3D scenes by addressing the inefficiencies of relying on 2D editors that yield view-inconsistent edits. It introduces Direct Gaussian Editor (DGE), which uses a fast 3D Gaussian Splatting representation and a two-stage process: first achieving multi-view consistency by extending a diffusion-based editor with spatio-temporal attention and epipolar constraints, then directly fitting the edited views to the 3D model via a rendering-based objective written as with LPIPS guidance. The approach enables selective editing since Gaussians can be masked and edited locally, and it yields substantial speedups (e.g., around 4 minutes per edit) with higher fidelity compared to prior methods like IN2N, GaussianEditor, and ViCa-NeRF. Experiments on real and synthetic datasets show improved CLIP-based alignment and texture detail, validating multi-view consistency and epipolar-guided feature propagation as key factors for effective 3D editing with diffusion priors.

Abstract

We consider the problem of editing 3D objects and scenes based on open-ended language instructions. A common approach to this problem is to use a 2D image generator or editor to guide the 3D editing process, obviating the need for 3D data. However, this process is often inefficient due to the need for iterative updates of costly 3D representations, such as neural radiance fields, either through individual view edits or score distillation sampling. A major disadvantage of this approach is the slow convergence caused by aggregating inconsistent information across views, as the guidance from 2D models is not multi-view consistent. We thus introduce the Direct Gaussian Editor (DGE), a method that addresses these issues in two stages. First, we modify a given high-quality image editor like InstructPix2Pix to be multi-view consistent. To do so, we propose a training-free approach that integrates cues from the 3D geometry of the underlying scene. Second, given a multi-view consistent edited sequence of images, we directly and efficiently optimize the 3D representation, which is based on 3D Gaussian Splatting. Because it avoids incremental and iterative edits, DGE is significantly more accurate and efficient than existing approaches and offers additional benefits, such as enabling selective editing of parts of the scene.
Paper Structure (42 sections, 6 equations, 12 figures, 3 tables)

This paper contains 42 sections, 6 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Overview. As shown on the left, our method is divided into two main parts: multi-view consistent editing with epipolar constraints and direct 3D fitting. In the multi-view editing stage, key views are randomly selected and jointly fed to the editing diffusion network to extract features with the spatial-temporal attention. To edit other frames, the features of key views are injected into the diffusion network through correspondence matching on feature maps with epipolar constraints. The detailed feature injection process is shown on the right; only features with a red border (i.e., the points following epipolar constraints) are considered for correspondence matching.
  • Figure 2: Comparison with other methods. Our method can provide fast and detailed editing effects, such as the textures on the Venetian mask and mosaic sculpture. Other methods, such as InstructN2N and IP2P+SD, fail to get the mosaic effects because they average over inconsistent editing.
  • Figure 3: The comparison between with and without multi-view consistency. With the proposed multi-view consistent editing, the edited 3D GS is clear and clean, while without it, it either fails to converge or leads to blurry results.
  • Figure 4: The comparison between edited 2D images with and without epipolar constraints. The one with epipolar constraints successfully matches the correspondences, while the other fails, thus resulting in inconsistent multi-view edits.
  • Figure 5: Comparison between our DGE and GaussianEditor chen2023gaussianeditor in terms of the number of iterations. Our method achieves realistic editing results with much fewer iterations. With more iterations, our method also gradually refines the details.
  • ...and 7 more figures