Table of Contents
Fetching ...

GaussEdit: Adaptive 3D Scene Editing with Text and Image Prompts

Zhenyu Shu, Junlong Yu, Kai Chao, Shiqing Xin, Ligang Liu

TL;DR

GaussEdit introduces a three‑stage framework for adaptive 3D scene editing driven by text and image prompts, anchored in 3D Gaussian Splatting. It combines fast ROI‑based Gaussian initialization, an Adaptive Global‑Local Optimization loop with category‑guided regularization to mitigate the Janus problem, and a texture refinement stage using image‑to‑image diffusion to achieve realistic, prompt‑aligned edits. Empirical results show superior editing accuracy, visual fidelity, and speed compared with prior work, across diverse real‑world and synthetic scenes. This approach enables precise, multi‑view‑consistent manipulation of 3D scenes while maintaining coherence with the surrounding environment, offering a practical tool for content creation and modeling workflows.

Abstract

This paper presents GaussEdit, a framework for adaptive 3D scene editing guided by text and image prompts. GaussEdit leverages 3D Gaussian Splatting as its backbone for scene representation, enabling convenient Region of Interest selection and efficient editing through a three-stage process. The first stage involves initializing the 3D Gaussians to ensure high-quality edits. The second stage employs an Adaptive Global-Local Optimization strategy to balance global scene coherence and detailed local edits and a category-guided regularization technique to alleviate the Janus problem. The final stage enhances the texture of the edited objects using a sophisticated image-to-image synthesis technique, ensuring that the results are visually realistic and align closely with the given prompts. Our experimental results demonstrate that GaussEdit surpasses existing methods in editing accuracy, visual fidelity, and processing speed. By successfully embedding user-specified concepts into 3D scenes, GaussEdit is a powerful tool for detailed and user-driven 3D scene editing, offering significant improvements over traditional methods.

GaussEdit: Adaptive 3D Scene Editing with Text and Image Prompts

TL;DR

GaussEdit introduces a three‑stage framework for adaptive 3D scene editing driven by text and image prompts, anchored in 3D Gaussian Splatting. It combines fast ROI‑based Gaussian initialization, an Adaptive Global‑Local Optimization loop with category‑guided regularization to mitigate the Janus problem, and a texture refinement stage using image‑to‑image diffusion to achieve realistic, prompt‑aligned edits. Empirical results show superior editing accuracy, visual fidelity, and speed compared with prior work, across diverse real‑world and synthetic scenes. This approach enables precise, multi‑view‑consistent manipulation of 3D scenes while maintaining coherence with the surrounding environment, offering a practical tool for content creation and modeling workflows.

Abstract

This paper presents GaussEdit, a framework for adaptive 3D scene editing guided by text and image prompts. GaussEdit leverages 3D Gaussian Splatting as its backbone for scene representation, enabling convenient Region of Interest selection and efficient editing through a three-stage process. The first stage involves initializing the 3D Gaussians to ensure high-quality edits. The second stage employs an Adaptive Global-Local Optimization strategy to balance global scene coherence and detailed local edits and a category-guided regularization technique to alleviate the Janus problem. The final stage enhances the texture of the edited objects using a sophisticated image-to-image synthesis technique, ensuring that the results are visually realistic and align closely with the given prompts. Our experimental results demonstrate that GaussEdit surpasses existing methods in editing accuracy, visual fidelity, and processing speed. By successfully embedding user-specified concepts into 3D scenes, GaussEdit is a powerful tool for detailed and user-driven 3D scene editing, offering significant improvements over traditional methods.

Paper Structure

This paper contains 18 sections, 11 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: The edited results using our approach, GaussEdit. The first column shows the original 3D Gaussian scenes. The second column presents the scenes edited using text prompts, while the third column displays the results after editing with image prompts, utilizing reference images of specific objects. These examples highlight the effectiveness of GaussEdit in embedding user-specified concepts into 3D scenes with high fidelity and visual realism.
  • Figure 2: Our GaussEdit enables editing a 3D scene using textual descriptions or reference images through a streamlined pipeline. Initially, we extract point clouds and set bounding boxes to define the Region of Interest. Subsequently, we downsample and reset the Gaussian parameters of the editing region's points. For the reference image, we utilize the Custom Diffusion method for user-specified concept learning and generate a special token "V*". Subsequently, we apply an Adaptive Global-Local Optimization strategy, selecting the input image and its corresponding prompt. This is followed by a category-guided regularization technique, where parameters are alternately optimized using Stable Diffusion and MVDream. When the prompt is processed by MVDream, the placeholder "OBJECT" in the prompt is replaced with its corresponding "CATEGORY". The preliminary editing results often contain noise; thus, we refine the object's texture using image-to-image synthesis. By incorporating the T2I diffusion model, we add noise to the rendered images and then perform denoising to ensure the final images align with the editing requirements.
  • Figure 3: Visual results of our proposed GaussEdit. The first column shows the original scene, while the next three display the edited results. Each scene is guided by text and image prompts, with the reference image placed in the bottom left corner of the edited results.
  • Figure 4: Visual results of our proposed GaussEdit. The first column shows the original scene, while the next three display the edited results. Each scene is guided by text and image prompts, with the reference image placed in the bottom left corner of the edited results.
  • Figure 5: Qualitative comparison on the image-driven editing setting.
  • ...and 5 more figures