Table of Contents
Fetching ...

SVGS: Single-View to 3D Object Editing via Gaussian Splatting

Pengcheng Xue, Yan Tian, Qiutao Song, Ziyi Wang, Linyang He, Weiping Ding, Mahmoud Hassaballah, Karen Egiazarian, Wei-Fa Yang, Leszek Rutkowski

Abstract

Text-driven 3D scene editing has attracted considerable interest due to its convenience and user-friendliness. However, methods that rely on implicit 3D representations, such as Neural Radiance Fields (NeRF), while effective in rendering complex scenes, are hindered by slow processing speeds and limited control over specific regions of the scene. Moreover, existing approaches, including Instruct-NeRF2NeRF and GaussianEditor, which utilize multi-view editing strategies, frequently produce inconsistent results across different views when executing text instructions. This inconsistency can adversely affect the overall performance of the model, complicating the task of balancing the consistency of editing results with editing efficiency. To address these challenges, we propose a novel method termed Single-View to 3D Object Editing via Gaussian Splatting (SVGS), which is a single-view text-driven editing technique based on 3D Gaussian Splatting (3DGS). Specifically, in response to text instructions, we introduce a single-view editing strategy grounded in multi-view diffusion models, which reconstructs 3D scenes by leveraging only those views that yield consistent editing results. Additionally, we employ sparse 3D Gaussian Splatting as the 3D representation, which significantly enhances editing efficiency. We conducted a comparative analysis of SVGS against existing baseline methods across various scene settings, and the results indicate that SVGS outperforms its counterparts in both editing capability and processing speed, representing a significant advancement in 3D editing technology. For further details, please visit our project page at: https://amateurc.github.io/svgs.github.io.

SVGS: Single-View to 3D Object Editing via Gaussian Splatting

Abstract

Text-driven 3D scene editing has attracted considerable interest due to its convenience and user-friendliness. However, methods that rely on implicit 3D representations, such as Neural Radiance Fields (NeRF), while effective in rendering complex scenes, are hindered by slow processing speeds and limited control over specific regions of the scene. Moreover, existing approaches, including Instruct-NeRF2NeRF and GaussianEditor, which utilize multi-view editing strategies, frequently produce inconsistent results across different views when executing text instructions. This inconsistency can adversely affect the overall performance of the model, complicating the task of balancing the consistency of editing results with editing efficiency. To address these challenges, we propose a novel method termed Single-View to 3D Object Editing via Gaussian Splatting (SVGS), which is a single-view text-driven editing technique based on 3D Gaussian Splatting (3DGS). Specifically, in response to text instructions, we introduce a single-view editing strategy grounded in multi-view diffusion models, which reconstructs 3D scenes by leveraging only those views that yield consistent editing results. Additionally, we employ sparse 3D Gaussian Splatting as the 3D representation, which significantly enhances editing efficiency. We conducted a comparative analysis of SVGS against existing baseline methods across various scene settings, and the results indicate that SVGS outperforms its counterparts in both editing capability and processing speed, representing a significant advancement in 3D editing technology. For further details, please visit our project page at: https://amateurc.github.io/svgs.github.io.

Paper Structure

This paper contains 19 sections, 15 equations, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: Results of 3D Object Editing with SVGS. This study demonstrates that text-driven 3D object editing can be accomplished through the modification of a single image, allowing for the generation of consistent multi-view images based on the edited outcomes. Notably, our algorithm facilitates precise alterations, such as changing a light blue hue to a dark blue hue, without impacting other regions of the object. Furthermore, the edited model can be exported as a high-quality mesh.
  • Figure 2: The Framework of SVGS. 1) Initially, a single-view image is modified using a correlation editing strategy based on IP2P. 2) Following the editing process, a multi-view diffusion model is employed to generate multi-view images consistent with the modifications. 3) For sparse reconstruction using 3DGS, a visual hull is constructed from camera parameters and masked images to initialize the 3D Gaussians, which are subsequently refined using the loss function $\mathcal{L}_{gs}$. 4) Within the depth regularization module, depth maps are rendered for the input views, and the loss is computed against pre-generated monocular depth maps. The resulting Gaussian field facilitates efficient, high-quality novel-view synthesis.
  • Figure 3: Visualization of the Relevance-Aware Editing Mechanism. We visualize the intermediate steps of our algorithm. The Noise Difference Heatmap (b), derived from Eq. (\ref{['eq:relevance_map']}), effectively acts as a semantic attention mechanism. It highlights the target region (e.g., the eyes for "put glasses on") where the text instruction exerts significant influence, while suppressing the background. This is thresholded into a binary mask (c) to ensure the final output strictly preserves the unedited content.
  • Figure 4: Consistency Experiments. The top row presents six images sourced from the original dataset, whereas the middle row illustrates the outcomes of independently editing each view utilizing the IP2P method. The results of these edits exhibit significant inconsistency and do not correspond effectively with the provided textual instructions. In contrast, our proposed method produces consistent and well-edited multi-view images, as demonstrated in the bottom row.
  • Figure 5: Qualitative comparison of text-driven 3D object editing. The visual results are organized by methods (columns) and editing tasks (rows). The first column displays the original reference image, while the remaining columns present multi-view renderings from Instruct-N2N haque2023instruct, GaussianEditor wang2024gaussianeditor, ViCA-NeRF dong2024vica, and our proposed SVGS. The seven rows correspond to distinct objects and editing instructions: (Row 1) color modification; (Row 2) accessory addition; (Row 3) style transfer; (Row 4) structural deformation (elongating rabbit ears); (Row 5) material transfer (wood to glossy marble); (Row 6) texture pattern addition (gold leaf); and (Row 7) localized part recoloring. Two views are rendered for each method to illustrate the appearance across different viewpoints.
  • ...and 3 more figures