Table of Contents
Fetching ...

3DSceneEditor: Controllable 3D Scene Editing with Gaussian Splatting

Ziyang Yan, Lei Li, Yihua Shao, Siyu Chen, Zongkai Wu, Jenq-Neng Hwang, Hao Zhao, Fabio Remondino

TL;DR

3DSceneEditor proposes a fully 3D Gaussian-based editing framework for complex scenes, enabling text-guided, real-time edits by directly manipulating Gaussians. It combines Mask3D semantic labeling, a CLIP-based open-vocabulary grounding module, and Gaussian-centric edits (add/remove/move/recolor/replace) within ROI, formalized as $G_{out} = Edit(G_{in}, \tau)$. Compared with state-of-the-art diffusion- and 2D-projection–based methods, it delivers higher editing quality (CTIS/CIIS), faster turnaround (initial edits in 2–5 minutes, secondary edits <1 minute), and lower GPU memory usage on indoor ScanNet++ scenes. This 3D-only paradigm advances interactive 3D content creation by leveraging explicit Gaussian representations for fine-grained, semantically aware modifications.

Abstract

The creation of 3D scenes has traditionally been both labor-intensive and costly, requiring designers to meticulously configure 3D assets and environments. Recent advancements in generative AI, including text-to-3D and image-to-3D methods, have dramatically reduced the complexity and cost of this process. However, current techniques for editing complex 3D scenes continue to rely on generally interactive multi-step, 2D-to-3D projection methods and diffusion-based techniques, which often lack precision in control and hamper real-time performance. In this work, we propose 3DSceneEditor, a fully 3D-based paradigm for real-time, precise editing of intricate 3D scenes using Gaussian Splatting. Unlike conventional methods, 3DSceneEditor operates through a streamlined 3D pipeline, enabling direct manipulation of Gaussians for efficient, high-quality edits based on input prompts.The proposed framework (i) integrates a pre-trained instance segmentation model for semantic labeling; (ii) employs a zero-shot grounding approach with CLIP to align target objects with user prompts; and (iii) applies scene modifications, such as object addition, repositioning, recoloring, replacing, and deletion directly on Gaussians. Extensive experimental results show that 3DSceneEditor achieves superior editing precision and speed with respect to current SOTA 3D scene editing approaches, establishing a new benchmark for efficient and interactive 3D scene customization.

3DSceneEditor: Controllable 3D Scene Editing with Gaussian Splatting

TL;DR

3DSceneEditor proposes a fully 3D Gaussian-based editing framework for complex scenes, enabling text-guided, real-time edits by directly manipulating Gaussians. It combines Mask3D semantic labeling, a CLIP-based open-vocabulary grounding module, and Gaussian-centric edits (add/remove/move/recolor/replace) within ROI, formalized as . Compared with state-of-the-art diffusion- and 2D-projection–based methods, it delivers higher editing quality (CTIS/CIIS), faster turnaround (initial edits in 2–5 minutes, secondary edits <1 minute), and lower GPU memory usage on indoor ScanNet++ scenes. This 3D-only paradigm advances interactive 3D content creation by leveraging explicit Gaussian representations for fine-grained, semantically aware modifications.

Abstract

The creation of 3D scenes has traditionally been both labor-intensive and costly, requiring designers to meticulously configure 3D assets and environments. Recent advancements in generative AI, including text-to-3D and image-to-3D methods, have dramatically reduced the complexity and cost of this process. However, current techniques for editing complex 3D scenes continue to rely on generally interactive multi-step, 2D-to-3D projection methods and diffusion-based techniques, which often lack precision in control and hamper real-time performance. In this work, we propose 3DSceneEditor, a fully 3D-based paradigm for real-time, precise editing of intricate 3D scenes using Gaussian Splatting. Unlike conventional methods, 3DSceneEditor operates through a streamlined 3D pipeline, enabling direct manipulation of Gaussians for efficient, high-quality edits based on input prompts.The proposed framework (i) integrates a pre-trained instance segmentation model for semantic labeling; (ii) employs a zero-shot grounding approach with CLIP to align target objects with user prompts; and (iii) applies scene modifications, such as object addition, repositioning, recoloring, replacing, and deletion directly on Gaussians. Extensive experimental results show that 3DSceneEditor achieves superior editing precision and speed with respect to current SOTA 3D scene editing approaches, establishing a new benchmark for efficient and interactive 3D scene customization.

Paper Structure

This paper contains 19 sections, 1 equation, 11 figures, 3 tables.

Figures (11)

  • Figure 1: We present 3DSceneEditor, an interactive, 3D-only framework designed for precise editing of complex 3D scenes based on natural language instructions. As shown in the figure, our method allows for fine-grained control over specific editing regions, enabling targeted modifications to the scene. 3DSceneEditor supports a wide variety of edits, including object removal, addition, recoloring, repositioning, and replacement. In this example, the system responds to different prompts to perform specific actions, such as removing the stool, changing its color to pink, moving it farther from the table, adding a plate of fruit, and replacing the stool with a wooden chair. This flexibility demonstrates the power of 3DSceneEditor in transforming scene elements while maintaining the realism and spatial consistency.
  • Figure 2: Our paradigm, named 3DSceneEditor, consists of three key steps. First, a pre-trained instance segmentation model is applied to understand the input scene and assign a semantic label to each Gaussian. Followed by an Open Vocabulary Object Grounding module, which is used to ground the target objects from the input semantic Gaussians and generate the ROI for target objects. Finally, we execute the specified scene editing operation in ROI based on the prompt and render the edited views.
  • Figure 3: Visualization of our Object Grounding. We first extract the key words from the prompt (bold fonts in the picture). Since the positional relationships between objects in 3D space change in different viewpoints, we need to project them onto a static 2D plane to better understand the scene.
  • Figure 4: Visualization of our object addition pipeline. We generate new objects using a Gaussian-based generative model, guided by keywords extracted from the prompt. With the assistance of the Object Grounding module, these new Gaussians are then integrated into the ROI within the input scene.
  • Figure 5: Extensive Results of 3DSceneEditor. This figure presents additional results across diverse scenes, demonstrating that our method enables precise and varied scene editing for layouts and objects of different scales.
  • ...and 6 more figures