Advancing 3D Gaussian Splatting Editing with Complementary and Consensus Information
Xuanqi Zhang, Jieun Lee, Chris Joslin, Wonsook Lee
TL;DR
3D Gaussian Splatting (3DGS) editing addresses view-dependent geometric inconsistency and depth maps that encode textures rather than geometry. The authors introduce Complementary Information Mutual Learning Network (CIMLN) for depth refinement and Wavelet Consensus Attention (WCA) for latent-code alignment in diffusion-based editing, integrated into the 3DGS pipeline. The 3DGS representation uses a Gaussian mixture ${\mathcal{G}} = \{(\sigma_i, \mu_i, \Sigma_i, c_i)\}_{i=1}^{N}$ and per-view depth is computed via ${\hat{D}} = \sum_{i \in \mathbf{N}} d_i \alpha_i \prod_{j=1}^{i-1} (1 - \alpha_j)$. The method achieves superior PSNR, RMSE, LPIPS and CLIPdir on multiple datasets, with efficient editing times thanks to 3DGS, demonstrating strong practical impact for text-guided 3D scene editing.
Abstract
We present a novel framework for enhancing the visual fidelity and consistency of text-guided 3D Gaussian Splatting (3DGS) editing. Existing editing approaches face two critical challenges: inconsistent geometric reconstructions across multiple viewpoints, particularly in challenging camera positions, and ineffective utilization of depth information during image manipulation, resulting in over-texture artifacts and degraded object boundaries. To address these limitations, we introduce: 1) A complementary information mutual learning network that enhances depth map estimation from 3DGS, enabling precise depth-conditioned 3D editing while preserving geometric structures. 2) A wavelet consensus attention mechanism that effectively aligns latent codes during the diffusion denoising process, ensuring multi-view consistency in the edited results. Through extensive experimentation, our method demonstrates superior performance in rendering quality and view consistency compared to state-of-the-art approaches. The results validate our framework as an effective solution for text-guided editing of 3D scenes.
