GaussCtrl: Multi-View Consistent Text-Driven 3D Gaussian Splatting Editing
Jing Wu, Jia-Wang Bian, Xinghui Li, Guangrun Wang, Ian Reid, Philip Torr, Victor Adrian Prisacariu
TL;DR
GaussCtrl tackles text-driven editing of 3D Gaussian Splatting scenes by enforcing multi-view consistency through depth-conditioned image editing via ControlNet and a novel attention-based latent code alignment across reference views. The method renders views from 3DGS, inverts to latent codes with DDIM, edits with edited prompts, and updates the 3D model once, achieving faster and more coherent edits than prior work. Its two core contributions—depth-guided geometry consistency and cross-view appearance alignment—enable high-quality edits across challenging 360-degree views. Compared to IN2N(GS) and ViCA-NeRF, GaussCtrl delivers sharper, more consistent results with reduced artefacts and shorter processing times, demonstrating a scalable path for reliable text-driven 3D asset editing. The approach leverages a differentiable 3D Gaussian representation and diffusion-based editing to integrate 3D geometry updates with 2D editing, advancing practical 3D content creation tools.
Abstract
We propose GaussCtrl, a text-driven method to edit a 3D scene reconstructed by the 3D Gaussian Splatting (3DGS). Our method first renders a collection of images by using the 3DGS and edits them by using a pre-trained 2D diffusion model (ControlNet) based on the input prompt, which is then used to optimise the 3D model. Our key contribution is multi-view consistent editing, which enables editing all images together instead of iteratively editing one image while updating the 3D model as in previous works. It leads to faster editing as well as higher visual quality. This is achieved by the two terms: (a) depth-conditioned editing that enforces geometric consistency across multi-view images by leveraging naturally consistent depth maps. (b) attention-based latent code alignment that unifies the appearance of edited images by conditioning their editing to several reference views through self and cross-view attention between images' latent representations. Experiments demonstrate that our method achieves faster editing and better visual results than previous state-of-the-art methods.
