Table of Contents
Fetching ...

GaussCtrl: Multi-View Consistent Text-Driven 3D Gaussian Splatting Editing

Jing Wu, Jia-Wang Bian, Xinghui Li, Guangrun Wang, Ian Reid, Philip Torr, Victor Adrian Prisacariu

TL;DR

GaussCtrl tackles text-driven editing of 3D Gaussian Splatting scenes by enforcing multi-view consistency through depth-conditioned image editing via ControlNet and a novel attention-based latent code alignment across reference views. The method renders views from 3DGS, inverts to latent codes with DDIM, edits with edited prompts, and updates the 3D model once, achieving faster and more coherent edits than prior work. Its two core contributions—depth-guided geometry consistency and cross-view appearance alignment—enable high-quality edits across challenging 360-degree views. Compared to IN2N(GS) and ViCA-NeRF, GaussCtrl delivers sharper, more consistent results with reduced artefacts and shorter processing times, demonstrating a scalable path for reliable text-driven 3D asset editing. The approach leverages a differentiable 3D Gaussian representation and diffusion-based editing to integrate 3D geometry updates with 2D editing, advancing practical 3D content creation tools.

Abstract

We propose GaussCtrl, a text-driven method to edit a 3D scene reconstructed by the 3D Gaussian Splatting (3DGS). Our method first renders a collection of images by using the 3DGS and edits them by using a pre-trained 2D diffusion model (ControlNet) based on the input prompt, which is then used to optimise the 3D model. Our key contribution is multi-view consistent editing, which enables editing all images together instead of iteratively editing one image while updating the 3D model as in previous works. It leads to faster editing as well as higher visual quality. This is achieved by the two terms: (a) depth-conditioned editing that enforces geometric consistency across multi-view images by leveraging naturally consistent depth maps. (b) attention-based latent code alignment that unifies the appearance of edited images by conditioning their editing to several reference views through self and cross-view attention between images' latent representations. Experiments demonstrate that our method achieves faster editing and better visual results than previous state-of-the-art methods.

GaussCtrl: Multi-View Consistent Text-Driven 3D Gaussian Splatting Editing

TL;DR

GaussCtrl tackles text-driven editing of 3D Gaussian Splatting scenes by enforcing multi-view consistency through depth-conditioned image editing via ControlNet and a novel attention-based latent code alignment across reference views. The method renders views from 3DGS, inverts to latent codes with DDIM, edits with edited prompts, and updates the 3D model once, achieving faster and more coherent edits than prior work. Its two core contributions—depth-guided geometry consistency and cross-view appearance alignment—enable high-quality edits across challenging 360-degree views. Compared to IN2N(GS) and ViCA-NeRF, GaussCtrl delivers sharper, more consistent results with reduced artefacts and shorter processing times, demonstrating a scalable path for reliable text-driven 3D asset editing. The approach leverages a differentiable 3D Gaussian representation and diffusion-based editing to integrate 3D geometry updates with 2D editing, advancing practical 3D content creation tools.

Abstract

We propose GaussCtrl, a text-driven method to edit a 3D scene reconstructed by the 3D Gaussian Splatting (3DGS). Our method first renders a collection of images by using the 3DGS and edits them by using a pre-trained 2D diffusion model (ControlNet) based on the input prompt, which is then used to optimise the 3D model. Our key contribution is multi-view consistent editing, which enables editing all images together instead of iteratively editing one image while updating the 3D model as in previous works. It leads to faster editing as well as higher visual quality. This is achieved by the two terms: (a) depth-conditioned editing that enforces geometric consistency across multi-view images by leveraging naturally consistent depth maps. (b) attention-based latent code alignment that unifies the appearance of edited images by conditioning their editing to several reference views through self and cross-view attention between images' latent representations. Experiments demonstrate that our method achieves faster editing and better visual results than previous state-of-the-art methods.
Paper Structure (16 sections, 5 equations, 9 figures, 1 table)

This paper contains 16 sections, 5 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: GaussCtrl. Our method edits a 3D Gaussian Splatting (3DGS) scene by modifying its descriptive prompt (Upper Left). This is achieved by editing the rendered images of 3DGS and re-training the 3D model (Upper Right). Our contribution is a depth-conditioned multi-view consistent editing framework, which substantially improves the blurry or unreasonable 3D results caused by inconsistent editing in previous work (Bottom).
  • Figure 2: GaussCtrl pipeline. Given a 3DGS scene and text instructions, our method renders images using the 3DGS and edits the rendered images with text instructions, which are then used to optimise the original 3DGS. Our key contribution is multi-view consistent editing. Towards this, we propose (1) depth-conditioned editing based on ControlNet for geometry consistency; and (2) attention-based latent code alignment for improving consistency during editing.
  • Figure 3: Qualitative results. We show diverse results of text-guided editing in various scenes, ranging from editing objects to adjusting environments, e.g., changing the appearance and age of the target human, and modifying the environment.
  • Figure 4: Qualitative comparison on 360-degree scenes. Our method generates more consistent and higher-quality images than previous state-of-the-art methods.
  • Figure 5: Qualitative results on forward-facing scenes. Our method generates more realistic results with better quality, consistency, and less artefact.
  • ...and 4 more figures