Table of Contents
Fetching ...

Drag Your Gaussian: Effective Drag-Based Editing with Score Distillation for 3D Gaussian Splatting

Yansong Qu, Dian Chen, Xinyang Li, Xiaofan Li, Shengchuan Zhang, Liujuan Cao, Rongrong Ji

TL;DR

This work tackles the challenge of geometric editing in 3D Gaussian Splatting (3DGS) by introducing Drag Your Gaussian (DYG), a drag-based method that uses 3D masks and control-point prompts to steer edits. It couples an implicit Multi-resolution Triplane Positional Encoder with a Region-Specific Positional Decoder and a Soft Local Edit strategy to overcome sparse Gaussian distributions and ensure localized, smooth geometry changes. Guided by an enhanced Drag-SDS loss that fuses a 2D Latent Diffusion Model with a composite noise prediction, DYG achieves multi-view consistent, fine-grained edits while preserving non-edited regions. Extensive experiments on real and generative scenes demonstrate state-of-the-art editing quality, including multi-round dragging and cross-domain generalization, with practical runtime efficiency relative to prior 3DGS editing methods. Limitations stem from the dependence on 2D priors, suggesting future work to speed up interaction and extend to dynamic 4D editing.

Abstract

Recent advancements in 3D scene editing have been propelled by the rapid development of generative models. Existing methods typically utilize generative models to perform text-guided editing on 3D representations, such as 3D Gaussian Splatting (3DGS). However, these methods are often limited to texture modifications and fail when addressing geometric changes, such as editing a character's head to turn around. Moreover, such methods lack accurate control over the spatial position of editing results, as language struggles to precisely describe the extent of edits. To overcome these limitations, we introduce DYG, an effective 3D drag-based editing method for 3D Gaussian Splatting. It enables users to conveniently specify the desired editing region and the desired dragging direction through the input of 3D masks and pairs of control points, thereby enabling precise control over the extent of editing. DYG integrates the strengths of the implicit triplane representation to establish the geometric scaffold of the editing results, effectively overcoming suboptimal editing outcomes caused by the sparsity of 3DGS in the desired editing regions. Additionally, we incorporate a drag-based Latent Diffusion Model into our method through the proposed Drag-SDS loss function, enabling flexible, multi-view consistent, and fine-grained editing. Extensive experiments demonstrate that DYG conducts effective drag-based editing guided by control point prompts, surpassing other baselines in terms of editing effect and quality, both qualitatively and quantitatively. Visit our project page at https://quyans.github.io/Drag-Your-Gaussian.

Drag Your Gaussian: Effective Drag-Based Editing with Score Distillation for 3D Gaussian Splatting

TL;DR

This work tackles the challenge of geometric editing in 3D Gaussian Splatting (3DGS) by introducing Drag Your Gaussian (DYG), a drag-based method that uses 3D masks and control-point prompts to steer edits. It couples an implicit Multi-resolution Triplane Positional Encoder with a Region-Specific Positional Decoder and a Soft Local Edit strategy to overcome sparse Gaussian distributions and ensure localized, smooth geometry changes. Guided by an enhanced Drag-SDS loss that fuses a 2D Latent Diffusion Model with a composite noise prediction, DYG achieves multi-view consistent, fine-grained edits while preserving non-edited regions. Extensive experiments on real and generative scenes demonstrate state-of-the-art editing quality, including multi-round dragging and cross-domain generalization, with practical runtime efficiency relative to prior 3DGS editing methods. Limitations stem from the dependence on 2D priors, suggesting future work to speed up interaction and extend to dynamic 4D editing.

Abstract

Recent advancements in 3D scene editing have been propelled by the rapid development of generative models. Existing methods typically utilize generative models to perform text-guided editing on 3D representations, such as 3D Gaussian Splatting (3DGS). However, these methods are often limited to texture modifications and fail when addressing geometric changes, such as editing a character's head to turn around. Moreover, such methods lack accurate control over the spatial position of editing results, as language struggles to precisely describe the extent of edits. To overcome these limitations, we introduce DYG, an effective 3D drag-based editing method for 3D Gaussian Splatting. It enables users to conveniently specify the desired editing region and the desired dragging direction through the input of 3D masks and pairs of control points, thereby enabling precise control over the extent of editing. DYG integrates the strengths of the implicit triplane representation to establish the geometric scaffold of the editing results, effectively overcoming suboptimal editing outcomes caused by the sparsity of 3DGS in the desired editing regions. Additionally, we incorporate a drag-based Latent Diffusion Model into our method through the proposed Drag-SDS loss function, enabling flexible, multi-view consistent, and fine-grained editing. Extensive experiments demonstrate that DYG conducts effective drag-based editing guided by control point prompts, surpassing other baselines in terms of editing effect and quality, both qualitatively and quantitatively. Visit our project page at https://quyans.github.io/Drag-Your-Gaussian.

Paper Structure

This paper contains 30 sections, 16 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: Differences between our drag-based editing approach and the text-guided editing method GS-Editor chen2024gaussianeditor. The latter often fails to achieve geometric editing goals and struggles to describe the degree of editing through text, whereas our method allows for flexible control over the extent of edits.
  • Figure 2: The overall framework of DYG. Left: Given a 3D Gaussian scene, users provide 3D masks and several pairs of control points as input. Top-right: The Smooth Geometric Editing module predicts positional offsets for 3D Gaussians, resolving the issue of sparse distributions within the target region while ensuring seamless local editing. We adopt a two-stage training strategy: the first stage constructs the geometric scaffold of the edited Gaussians, and the second stage refines the texture details. Bottom-right: In the Score Distillation Guidance Module, to ensure stable optimization, 3D control points are projected onto 2D control points for a specified viewpoint. The RGB image and 2D mask, rendered from the mirrored initial 3D Gaussians, are encoded into point embeddings (P-Emb) and appearance embeddings (A-Emb), which act as conditions for the drag-based LDM. This process leverages our proposed Drag-SDS loss function to enable flexible and view-consistent 3D drag-based editing.
  • Figure 3: Detailed illustration of Score Distillation Guidance Module and Drag-SDS loss, presented in Fig. \ref{['fig-pipeline']}. We employ two different UNets to predict $\epsilon_{\text{tgt}}$ and $\epsilon_{\text{src}}$ described in Eq. (\ref{['eq-composit_epsilon']}), respectively. The components within the orange box represent the inputs to the Inpainting UNet, while the components within the green box signify the inputs to the Original SD UNet.
  • Figure 4: Qualitative comparison between DYG and different baselines. The first column shows two rendered views of the original 3D scene, where the 3D editing points are projected onto the 2D plane for visualization. SC-GS huang2024scgs may show unnatural results, as well as blurring or tearing of the background, while GS-Editor chen2024gaussianeditor and GS-Ctrl wu2025gaussctrl frequently fail to perform successful edits. Additionally, GS-Ctrl tends to exhibit over-saturation issues, and 2D-Lifting suffers from scene blurriness. By contrast, DYG is able to sufficiently interpret both the user’s dragging intent and the 3D scene context, thereby achieving effective editing and generating detailed results across various scenarios, including deformation, transformation, morphing.
  • Figure 5: More qualitative results. The top three rows showcase real scenes, while the bottom two rows are generated scenes. For each edit, we show two views of both the original (OS) and edited scenes (ES).
  • ...and 8 more figures