Table of Contents
Fetching ...

GaussianEditor: Editing 3D Gaussians Delicately with Text Instructions

Junjie Wang, Jiemin Fang, Xiaopeng Zhang, Lingxi Xie, Qi Tian

TL;DR

This work tackles the challenge of delicate, localized editing in 3D scenes by representing scenes as explicit 3D Gaussians and editing only a targeted region. It introduces GaussianEditor, a three-stage framework that (i) extracts a text-driven region of interest (RoI), (ii) aligns that RoI to the 3D Gaussian space via segmentation and lifting, and (iii) performs diffusion-based editing constrained to the Gaussian RoI with backpropagation limited to that region. Key contributions include an end-to-end RoI extraction/grounding pathway powered by LLMs, a RoI-alignment mechanism with per-Gaussian RoI attributes, and a delicate editing loop that achieves precise changes while preserving global structure, all with substantially faster training (about 20 minutes on a single V100) than prior methods. Empirically, GaussianEditor yields stronger local edit precision than Instruct-NeRF2NeRF on multi-object scenes, with favorable quantitative metrics and user-study preferences, while highlighting limitations in grounding reliability and diffusion stability. The approach enables robust, region-specific 3D edits and paves the way for deeper integration with 3D generative models and dynamic scene editing.

Abstract

Recently, impressive results have been achieved in 3D scene editing with text instructions based on a 2D diffusion model. However, current diffusion models primarily generate images by predicting noise in the latent space, and the editing is usually applied to the whole image, which makes it challenging to perform delicate, especially localized, editing for 3D scenes. Inspired by recent 3D Gaussian splatting, we propose a systematic framework, named GaussianEditor, to edit 3D scenes delicately via 3D Gaussians with text instructions. Benefiting from the explicit property of 3D Gaussians, we design a series of techniques to achieve delicate editing. Specifically, we first extract the region of interest (RoI) corresponding to the text instruction, aligning it to 3D Gaussians. The Gaussian RoI is further used to control the editing process. Our framework can achieve more delicate and precise editing of 3D scenes than previous methods while enjoying much faster training speed, i.e. within 20 minutes on a single V100 GPU, more than twice as fast as Instruct-NeRF2NeRF (45 minutes -- 2 hours).

GaussianEditor: Editing 3D Gaussians Delicately with Text Instructions

TL;DR

This work tackles the challenge of delicate, localized editing in 3D scenes by representing scenes as explicit 3D Gaussians and editing only a targeted region. It introduces GaussianEditor, a three-stage framework that (i) extracts a text-driven region of interest (RoI), (ii) aligns that RoI to the 3D Gaussian space via segmentation and lifting, and (iii) performs diffusion-based editing constrained to the Gaussian RoI with backpropagation limited to that region. Key contributions include an end-to-end RoI extraction/grounding pathway powered by LLMs, a RoI-alignment mechanism with per-Gaussian RoI attributes, and a delicate editing loop that achieves precise changes while preserving global structure, all with substantially faster training (about 20 minutes on a single V100) than prior methods. Empirically, GaussianEditor yields stronger local edit precision than Instruct-NeRF2NeRF on multi-object scenes, with favorable quantitative metrics and user-study preferences, while highlighting limitations in grounding reliability and diffusion stability. The approach enables robust, region-specific 3D edits and paves the way for deeper integration with 3D generative models and dynamic scene editing.

Abstract

Recently, impressive results have been achieved in 3D scene editing with text instructions based on a 2D diffusion model. However, current diffusion models primarily generate images by predicting noise in the latent space, and the editing is usually applied to the whole image, which makes it challenging to perform delicate, especially localized, editing for 3D scenes. Inspired by recent 3D Gaussian splatting, we propose a systematic framework, named GaussianEditor, to edit 3D scenes delicately via 3D Gaussians with text instructions. Benefiting from the explicit property of 3D Gaussians, we design a series of techniques to achieve delicate editing. Specifically, we first extract the region of interest (RoI) corresponding to the text instruction, aligning it to 3D Gaussians. The Gaussian RoI is further used to control the editing process. Our framework can achieve more delicate and precise editing of 3D scenes than previous methods while enjoying much faster training speed, i.e. within 20 minutes on a single V100 GPU, more than twice as fast as Instruct-NeRF2NeRF (45 minutes -- 2 hours).
Paper Structure (36 sections, 14 equations, 13 figures, 2 tables)

This paper contains 36 sections, 14 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: We propose GaussianEditor, an interactive framework to achieve delicate 3D scene editing following text instructions. As shown in this figure, our method can precisely control the editing region and achieve multi-round editing.
  • Figure 2: Our framework, named GaussianEditor, consists of three key steps. First, a module ${\mathcal{M}}_{Desc}$ is used to get the description of the input scene, which is put to an LLM assistant ${\mathcal{M}}_{LLM}$ with the text instruction ${\mathcal{T}}$ provided by the user to obtain the text RoI ${\mathcal{T}}_{RoI}$. Second, a grounding segmentation module ${\mathcal{M}}_{Seg}$ is used to convert ${\mathcal{T}}_{RoI}$ to image RoI ${\mathcal{I}}_{RoI}$, which is then lifted to 3D Gaussians RoI ${\mathcal{G}}_{RoI}$ by RoI lifting ${\mathcal{M}}_{Lift}$, where additional user instructions ${\mathcal{O}}$ can be incorporated. Third, following the user instruction ${\mathcal{T}}$, rendered image ${\mathcal{I}}_{render}$ from randomly chosen views is edited by a diffusion model ${\mathcal{M}}_{DM}$. The loss between ${\mathcal{I}}_{render}$ and edited one ${\mathcal{I}}_{edit}$ is calculated. Finally, gradient backpropagation and optimization are performed within the Gaussian RoI ${\mathcal{G}}_{RoI}$ to get the edited scene ${\mathcal{G}}_{edit}$.
  • Figure 3: The process of obtaining scene description.
  • Figure 4: Qualitative results on outdoor scenes. Our method supports separate foreground and background editing in real-world scenes.
  • Figure 5: Comparisons with Instruct-NeRF2NeRF (IN2N) instructnerf2023 on the scene presented in their paper.
  • ...and 8 more figures