Table of Contents
Fetching ...

EditSplat: Multi-View Fusion and Attention-Guided Optimization for View-Consistent 3D Scene Editing with 3D Gaussian Splatting

Dong In Lee, Hyeongcheol Park, Jiyoung Seo, Eunbyung Park, Hyunje Park, Ha Dam Baek, Sangheon Shin, Sangmin Kim, Sangpil Kim

TL;DR

EditSplat tackles the dual challenges of multi-view inconsistency and optimization inefficiency in text-driven 3D scene editing with 3D Gaussian Splatting. It introduces Multi-view Fusion Guidance (MFG) to enforce cross-view coherence by integrating multi-view details into diffusion-based edits, and Attention-Guided Trimming (AGT) to prune and selectively optimize Gaussians based on attention maps for semantic local editing. The approach yields state-of-the-art qualitative and quantitative performance across diverse datasets, demonstrating robust view-consistent edits and improved optimization efficiency. This framework enables practical, high-fidelity 3D edits guided solely by text prompts, with potential impact on AR/VR content creation and real-time editing workflows.

Abstract

Recent advancements in 3D editing have highlighted the potential of text-driven methods in real-time, user-friendly AR/VR applications. However, current methods rely on 2D diffusion models without adequately considering multi-view information, resulting in multi-view inconsistency. While 3D Gaussian Splatting (3DGS) significantly improves rendering quality and speed, its 3D editing process encounters difficulties with inefficient optimization, as pre-trained Gaussians retain excessive source information, hindering optimization. To address these limitations, we propose EditSplat, a novel text-driven 3D scene editing framework that integrates Multi-view Fusion Guidance (MFG) and Attention-Guided Trimming (AGT). Our MFG ensures multi-view consistency by incorporating essential multi-view information into the diffusion process, leveraging classifier-free guidance from the text-to-image diffusion model and the geometric structure inherent to 3DGS. Additionally, our AGT utilizes the explicit representation of 3DGS to selectively prune and optimize 3D Gaussians, enhancing optimization efficiency and enabling precise, semantically rich local editing. Through extensive qualitative and quantitative evaluations, EditSplat achieves state-of-the-art performance, establishing a new benchmark for text-driven 3D scene editing.

EditSplat: Multi-View Fusion and Attention-Guided Optimization for View-Consistent 3D Scene Editing with 3D Gaussian Splatting

TL;DR

EditSplat tackles the dual challenges of multi-view inconsistency and optimization inefficiency in text-driven 3D scene editing with 3D Gaussian Splatting. It introduces Multi-view Fusion Guidance (MFG) to enforce cross-view coherence by integrating multi-view details into diffusion-based edits, and Attention-Guided Trimming (AGT) to prune and selectively optimize Gaussians based on attention maps for semantic local editing. The approach yields state-of-the-art qualitative and quantitative performance across diverse datasets, demonstrating robust view-consistent edits and improved optimization efficiency. This framework enables practical, high-fidelity 3D edits guided solely by text prompts, with potential impact on AR/VR content creation and real-time editing workflows.

Abstract

Recent advancements in 3D editing have highlighted the potential of text-driven methods in real-time, user-friendly AR/VR applications. However, current methods rely on 2D diffusion models without adequately considering multi-view information, resulting in multi-view inconsistency. While 3D Gaussian Splatting (3DGS) significantly improves rendering quality and speed, its 3D editing process encounters difficulties with inefficient optimization, as pre-trained Gaussians retain excessive source information, hindering optimization. To address these limitations, we propose EditSplat, a novel text-driven 3D scene editing framework that integrates Multi-view Fusion Guidance (MFG) and Attention-Guided Trimming (AGT). Our MFG ensures multi-view consistency by incorporating essential multi-view information into the diffusion process, leveraging classifier-free guidance from the text-to-image diffusion model and the geometric structure inherent to 3DGS. Additionally, our AGT utilizes the explicit representation of 3DGS to selectively prune and optimize 3D Gaussians, enhancing optimization efficiency and enabling precise, semantically rich local editing. Through extensive qualitative and quantitative evaluations, EditSplat achieves state-of-the-art performance, establishing a new benchmark for text-driven 3D scene editing.

Paper Structure

This paper contains 39 sections, 14 equations, 17 figures, 3 tables, 1 algorithm.

Figures (17)

  • Figure 1: Result of EditSplat. EditSplat enables flexible and high-quality editing of pre-trained 3D Gaussian Splatting models guided solely by textual instructions. Through its design focused on multi-view consistency, efficient optimization, and precise semantic local editing, Our approach demonstrates robust performance, producing realistic and fine-grained 3D scene modifications.
  • Figure 2: Challenging cases. (a) Conventional 2D editing without Multi-view Fusion Guidance (MFG) results in inconsistent textures across different views (e.g., bear fur), whereas applying MFG achieves consistent multi-view edits. (b) Editing without Attention-Guided Trimming (AGT) leads to inefficient optimization, resulting in less edited regions (e.g., clown’s nose). AGT effectively enhances optimization quality, producing richer colors.
  • Figure 3: EditSplat Overview. EditSplat consists of two main methods: (1) Multi-view Fusion Guidance (MFG, \ref{['sec:MFG']}), which aligns multi-view information with text prompts and source images to ensure multi-view consistency; (2) Attention-Guided Trimming (AGT, \ref{['sec:AGT']}), which prunes pre-trained Gaussians for optimization efficiency and selectively optimizes Gaussians for semantic local editing.
  • Figure 4: Qualitative Comparison. EditSplat provides more intense and precise editing compared to other baselines. The leftmost column shows source images, while the right columns show rendering images from edited 3DGS. In each corner of the images, we include different views of the corresponding image to compare multi-view consistency. Note that our EditSplat outperforms both local and global editing.
  • Figure 5: MFG ablation. The top row shows the results of 2D editing with and without MFG. With MFG, the 2D diffusion model produces multi-view-consistent results. In the bottom row, we have images rendered from edited 3DGS based on the above images. Editing 3DGS with view-consistent images results in clear outputs with high fidelity, while other cases produce inconsistent results with low fidelity.
  • ...and 12 more figures