Table of Contents
Fetching ...

Advancing 3D Gaussian Splatting Editing with Complementary and Consensus Information

Xuanqi Zhang, Jieun Lee, Chris Joslin, Wonsook Lee

TL;DR

3D Gaussian Splatting (3DGS) editing addresses view-dependent geometric inconsistency and depth maps that encode textures rather than geometry. The authors introduce Complementary Information Mutual Learning Network (CIMLN) for depth refinement and Wavelet Consensus Attention (WCA) for latent-code alignment in diffusion-based editing, integrated into the 3DGS pipeline. The 3DGS representation uses a Gaussian mixture ${\mathcal{G}} = \{(\sigma_i, \mu_i, \Sigma_i, c_i)\}_{i=1}^{N}$ and per-view depth is computed via ${\hat{D}} = \sum_{i \in \mathbf{N}} d_i \alpha_i \prod_{j=1}^{i-1} (1 - \alpha_j)$. The method achieves superior PSNR, RMSE, LPIPS and CLIPdir on multiple datasets, with efficient editing times thanks to 3DGS, demonstrating strong practical impact for text-guided 3D scene editing.

Abstract

We present a novel framework for enhancing the visual fidelity and consistency of text-guided 3D Gaussian Splatting (3DGS) editing. Existing editing approaches face two critical challenges: inconsistent geometric reconstructions across multiple viewpoints, particularly in challenging camera positions, and ineffective utilization of depth information during image manipulation, resulting in over-texture artifacts and degraded object boundaries. To address these limitations, we introduce: 1) A complementary information mutual learning network that enhances depth map estimation from 3DGS, enabling precise depth-conditioned 3D editing while preserving geometric structures. 2) A wavelet consensus attention mechanism that effectively aligns latent codes during the diffusion denoising process, ensuring multi-view consistency in the edited results. Through extensive experimentation, our method demonstrates superior performance in rendering quality and view consistency compared to state-of-the-art approaches. The results validate our framework as an effective solution for text-guided editing of 3D scenes.

Advancing 3D Gaussian Splatting Editing with Complementary and Consensus Information

TL;DR

3D Gaussian Splatting (3DGS) editing addresses view-dependent geometric inconsistency and depth maps that encode textures rather than geometry. The authors introduce Complementary Information Mutual Learning Network (CIMLN) for depth refinement and Wavelet Consensus Attention (WCA) for latent-code alignment in diffusion-based editing, integrated into the 3DGS pipeline. The 3DGS representation uses a Gaussian mixture and per-view depth is computed via . The method achieves superior PSNR, RMSE, LPIPS and CLIPdir on multiple datasets, with efficient editing times thanks to 3DGS, demonstrating strong practical impact for text-guided 3D scene editing.

Abstract

We present a novel framework for enhancing the visual fidelity and consistency of text-guided 3D Gaussian Splatting (3DGS) editing. Existing editing approaches face two critical challenges: inconsistent geometric reconstructions across multiple viewpoints, particularly in challenging camera positions, and ineffective utilization of depth information during image manipulation, resulting in over-texture artifacts and degraded object boundaries. To address these limitations, we introduce: 1) A complementary information mutual learning network that enhances depth map estimation from 3DGS, enabling precise depth-conditioned 3D editing while preserving geometric structures. 2) A wavelet consensus attention mechanism that effectively aligns latent codes during the diffusion denoising process, ensuring multi-view consistency in the edited results. Through extensive experimentation, our method demonstrates superior performance in rendering quality and view consistency compared to state-of-the-art approaches. The results validate our framework as an effective solution for text-guided editing of 3D scenes.

Paper Structure

This paper contains 12 sections, 10 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of the proposed 3D Gaussian splatteing editing system. (a) Multi-view images $\mathcal{M}$ are used to train a 3DGS model, from which the rendered depth and RGB images can be obtained from a certain viewpoint. (b) Rendered depth maps and RGB images are processed through Source and Guide branches respectively. Through pixel mutual learning and downsampling, the system enables self-supervised training. (c) The framework replaces ControlNet's image self-attention with WCA to better align images with reference views.
  • Figure 2: Visual comparison of text-drive 3D editing on different scenes. The inconsistency and noises are decreased in detail.
  • Figure 3: Visual comparison of prompt 'Turn it into a polar bear'. Our method maintains multi-view consistency for rendered images.
  • Figure 4: Detailed design of Pixel Mutual Learning.
  • Figure 5: Ablation study on the effect of the proposed CIMLN module.
  • ...and 1 more figures