Table of Contents
Fetching ...

Variation-aware Flexible 3D Gaussian Editing

Hao Qin, Yukai Sun, Meng Wang, Ming Kong, Mengxu Lu, Qiang Zhu

TL;DR

The paper tackles cross-view inconsistencies and limited flexibility in editing 3D Gaussians by proposing VF-Editor, a native 3D editing framework that predicts per-primitive variations via a variation predictor. It distills multi-source 2D editing priors into a unified model composed of a Random Tokenizer, a Variation Field Generation Module, and Iterative Parallel Decoding Functions, enabling real-time edits with $oldsymbol{\Delta} = \{\delta_{\mu}, \delta_{s}, \delta_{\alpha}, \delta_{c}, \delta_{r}\}$ and edited output $\mathcal{X}^{r} = \mathcal{X}^{s} + \Delta$. Key contributions include the variation field approach, linear-time parallel decoding, and multi-domain knowledge distillation using DDIM, diffusion inversion, and SDS-based strategies, validated on public/private data with improved Aesthetic/Consistency metrics and diverse editing capabilities. The method enables flexible, open-vocabulary 3D edits in real time, with strong generalization to unseen data and straightforward interpretability for adjustment and composition of edits across scenes and instructions.

Abstract

Indirect editing methods for 3D Gaussian Splatting (3DGS) have recently witnessed significant advancements. These approaches operate by first applying edits in the rendered 2D space and subsequently projecting the modifications back into 3D. However, this paradigm inevitably introduces cross-view inconsistencies and constrains both the flexibility and efficiency of the editing process. To address these challenges, we present VF-Editor, which enables native editing of Gaussian primitives by predicting attribute variations in a feedforward manner. To accurately and efficiently estimate these variations, we design a novel variation predictor distilled from 2D editing knowledge. The predictor encodes the input to generate a variation field and employs two learnable, parallel decoding functions to iteratively infer attribute changes for each 3D Gaussian. Thanks to its unified design, VF-Editor can seamlessly distill editing knowledge from diverse 2D editors and strategies into a single predictor, allowing for flexible and effective knowledge transfer into the 3D domain. Extensive experiments on both public and private datasets reveal the inherent limitations of indirect editing pipelines and validate the effectiveness and flexibility of our approach.

Variation-aware Flexible 3D Gaussian Editing

TL;DR

The paper tackles cross-view inconsistencies and limited flexibility in editing 3D Gaussians by proposing VF-Editor, a native 3D editing framework that predicts per-primitive variations via a variation predictor. It distills multi-source 2D editing priors into a unified model composed of a Random Tokenizer, a Variation Field Generation Module, and Iterative Parallel Decoding Functions, enabling real-time edits with and edited output . Key contributions include the variation field approach, linear-time parallel decoding, and multi-domain knowledge distillation using DDIM, diffusion inversion, and SDS-based strategies, validated on public/private data with improved Aesthetic/Consistency metrics and diverse editing capabilities. The method enables flexible, open-vocabulary 3D edits in real time, with strong generalization to unseen data and straightforward interpretability for adjustment and composition of edits across scenes and instructions.

Abstract

Indirect editing methods for 3D Gaussian Splatting (3DGS) have recently witnessed significant advancements. These approaches operate by first applying edits in the rendered 2D space and subsequently projecting the modifications back into 3D. However, this paradigm inevitably introduces cross-view inconsistencies and constrains both the flexibility and efficiency of the editing process. To address these challenges, we present VF-Editor, which enables native editing of Gaussian primitives by predicting attribute variations in a feedforward manner. To accurately and efficiently estimate these variations, we design a novel variation predictor distilled from 2D editing knowledge. The predictor encodes the input to generate a variation field and employs two learnable, parallel decoding functions to iteratively infer attribute changes for each 3D Gaussian. Thanks to its unified design, VF-Editor can seamlessly distill editing knowledge from diverse 2D editors and strategies into a single predictor, allowing for flexible and effective knowledge transfer into the 3D domain. Extensive experiments on both public and private datasets reveal the inherent limitations of indirect editing pipelines and validate the effectiveness and flexibility of our approach.
Paper Structure (42 sections, 16 equations, 17 figures, 11 tables)

This paper contains 42 sections, 16 equations, 17 figures, 11 tables.

Figures (17)

  • Figure 1: VF-Editor is a native editing method for 3D Gaussian Splatting across multiple scenes and instructions. In the top-left corner, we present a 2D visualization of the 3D variation within VF-Editor; please refer to App. \ref{['seca2']} for specific visualization rules.
  • Figure 2: Schematic of VF-Editor. Given a 3D scene $\mathcal{X}^{s}$ and an editing instruction $y$, the variation predictor $\mathcal{P}_{\theta}$ generates variations which, when overlaid on the input scene $\mathcal{X}^{s}$, produce the edited result $\mathcal{X}^{r}$. VF-Editor trains $\mathcal{P}_{\theta}$ by distilling multi-source visual editing knowledge.
  • Figure 3: Qualitative comparison. VF-Editor achieves desired 3D editing with maximal preservation of original information. For video results, please see Demo.mp4 in the supplementary materials.
  • Figure 4: Visualization of the ablation study of iterative decoding. Direct decoding impairs the model's predictive capability regarding the positional changes of the 3D Gaussian.
  • Figure 5: Visualization of the ablation study of parallel decoding function. (Left) The triplane decoding strategy used for ablation. (Right) Display of the reference image and editing results.
  • ...and 12 more figures