Table of Contents
Fetching ...

3D-Consistent Multi-View Editing by Diffusion Guidance

Josef Bengtson, David Nilsson, Dong In Lee, Fredrik Kahl

TL;DR

This work addresses the problem of geometrically and photometrically inconsistent edits when applying image-editing methods to multi-view 3D scenes. It introduces a training-free diffusion-guidance framework that enforces cross-view consistency via a consistency loss computed from matched points across unedited views, guiding diffusion sampling toward coherent edits. The method supports both dense and sparse view editing and can directly refine 3D Gaussian Splat models, achieving sharp, faithful edits while improving multi-view consistency over existing baselines. Extensive experiments show improved consistency, competitive or superior text-alignment fidelity, and effective sparse-view editing, enabling high-quality 3D-aware edits with practical compute. The approach is demonstrated on Gaussian Splat editing and offers a scalable path to robust 3D content editing using 2D diffusion editors.

Abstract

Recent advancements in diffusion models have greatly improved text-based image editing, yet methods that edit images independently often produce geometrically and photometrically inconsistent results across different views of the same scene. Such inconsistencies are particularly problematic for editing of 3D representations such as NeRFs or Gaussian Splat models. We propose a training-free diffusion framework that enforces multi-view consistency during the image editing process. The key assumption is that corresponding points in the unedited images should undergo similar transformations after editing. To achieve this, we introduce a consistency loss that guides the diffusion sampling toward coherent edits. The framework is flexible and can be combined with widely varying image editing methods, supporting both dense and sparse multi-view editing setups. Experimental results show that our approach significantly improves 3D consistency compared to existing multi-view editing methods. We also show that this increased consistency enables high-quality Gaussian Splat editing with sharp details and strong fidelity to user-specified text prompts. Please refer to our project page for video results: https://3d-consistent-editing.github.io/

3D-Consistent Multi-View Editing by Diffusion Guidance

TL;DR

This work addresses the problem of geometrically and photometrically inconsistent edits when applying image-editing methods to multi-view 3D scenes. It introduces a training-free diffusion-guidance framework that enforces cross-view consistency via a consistency loss computed from matched points across unedited views, guiding diffusion sampling toward coherent edits. The method supports both dense and sparse view editing and can directly refine 3D Gaussian Splat models, achieving sharp, faithful edits while improving multi-view consistency over existing baselines. Extensive experiments show improved consistency, competitive or superior text-alignment fidelity, and effective sparse-view editing, enabling high-quality 3D-aware edits with practical compute. The approach is demonstrated on Gaussian Splat editing and offers a scalable path to robust 3D content editing using 2D diffusion editors.

Abstract

Recent advancements in diffusion models have greatly improved text-based image editing, yet methods that edit images independently often produce geometrically and photometrically inconsistent results across different views of the same scene. Such inconsistencies are particularly problematic for editing of 3D representations such as NeRFs or Gaussian Splat models. We propose a training-free diffusion framework that enforces multi-view consistency during the image editing process. The key assumption is that corresponding points in the unedited images should undergo similar transformations after editing. To achieve this, we introduce a consistency loss that guides the diffusion sampling toward coherent edits. The framework is flexible and can be combined with widely varying image editing methods, supporting both dense and sparse multi-view editing setups. Experimental results show that our approach significantly improves 3D consistency compared to existing multi-view editing methods. We also show that this increased consistency enables high-quality Gaussian Splat editing with sharp details and strong fidelity to user-specified text prompts. Please refer to our project page for video results: https://3d-consistent-editing.github.io/

Paper Structure

This paper contains 21 sections, 2 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Image editing methods applied independently to multi-view images often produce inconsistent edits across views, as shown here where corresponding regions differ between edits. Our method improves multi-view consistency by guiding the diffusion process under the assumption that matching points in the unedited images, as shown by the red lines, should be edited similarly.
  • Figure 2: Overview of our method. Given a set of input images, each view is edited sequentially by guiding the diffusion process based on the previously edited images. The guidance is based on the assumption that matching points in the unedited images should be edited similarly. During the diffusion process the noise estimate $\epsilon(z_t,t)$ is modified according to a consistency loss $\mathcal{L}_c$ resulting in multi-view consistent edits. In turn, these edited images can be used to update a Gaussian splat model.
  • Figure 3: Qualitative example of multi-view consistent image editing using methods all based on the image editing method IP2P. We note that EditSplat and DGE edit the image more than our method, which is more similar to the unedited images, as can be seen e.g. in the shape of the face or the texture of the grass next to the bench. We also see that our method produces more consistent edits for the different views as can be seen in the face or arms of the person, and also for the bicycle wheels.
  • Figure 4: We show renderings from edited 3D Gaussian splat models. For the face we note that the per image edits sometimes cause blurriness, as seen e.g. in the ears which are sharper for ours. EditSplat uses a segmentation mask, leading to the edit being localized only to the face and hair of the person. For DGE we see that the fidelity to the text prompt is high, but the face has drastically altered appearance. For the garden our method gives a more clear edit than per image edits and EditSplat, as seen by more visible fog and clearer details preserved on the object on the table.
  • Figure 5: We show our multi-view consistent editing using the image editing method pix2pix-turbo. We see that the per image edits can be inconsistent and that there is a loss of detail when editing a 3D Gaussian splat model using these inconsistent images. In contrast our edits are more consistent and the details are more accurately recovered by the Gaussian splat renderings.
  • ...and 7 more figures