DGE: Direct Gaussian 3D Editing by Consistent Multi-view Editing
Minghao Chen, Iro Laina, Andrea Vedaldi
TL;DR
This work tackles language-guided editing of 3D scenes by addressing the inefficiencies of relying on 2D editors that yield view-inconsistent edits. It introduces Direct Gaussian Editor (DGE), which uses a fast 3D Gaussian Splatting representation and a two-stage process: first achieving multi-view consistency by extending a diffusion-based editor with spatio-temporal attention and epipolar constraints, then directly fitting the edited views to the 3D model via a rendering-based objective written as $\min_{\mathcal{G}} \sum_{t=1}^T \| I'_t - \operatorname{Rend}(\mathcal{G}, \pi_t) \|$ with LPIPS guidance. The approach enables selective editing since Gaussians can be masked and edited locally, and it yields substantial speedups (e.g., around 4 minutes per edit) with higher fidelity compared to prior methods like IN2N, GaussianEditor, and ViCa-NeRF. Experiments on real and synthetic datasets show improved CLIP-based alignment and texture detail, validating multi-view consistency and epipolar-guided feature propagation as key factors for effective 3D editing with diffusion priors.
Abstract
We consider the problem of editing 3D objects and scenes based on open-ended language instructions. A common approach to this problem is to use a 2D image generator or editor to guide the 3D editing process, obviating the need for 3D data. However, this process is often inefficient due to the need for iterative updates of costly 3D representations, such as neural radiance fields, either through individual view edits or score distillation sampling. A major disadvantage of this approach is the slow convergence caused by aggregating inconsistent information across views, as the guidance from 2D models is not multi-view consistent. We thus introduce the Direct Gaussian Editor (DGE), a method that addresses these issues in two stages. First, we modify a given high-quality image editor like InstructPix2Pix to be multi-view consistent. To do so, we propose a training-free approach that integrates cues from the 3D geometry of the underlying scene. Second, given a multi-view consistent edited sequence of images, we directly and efficiently optimize the 3D representation, which is based on 3D Gaussian Splatting. Because it avoids incremental and iterative edits, DGE is significantly more accurate and efficient than existing approaches and offers additional benefits, such as enabling selective editing of parts of the scene.
