Table of Contents
Fetching ...

TIGER: Text-Instructed 3D Gaussian Retrieval and Coherent Editing

Teng Xu, Jiamin Chen, Peng Chen, Youjia Zhang, Junqing Yu, Wei Yang

TL;DR

TIGER addresses text-driven retrieval and editing of explicit 3D Gaussian scenes by embedding language into each Gaussian primitive via a bottom-up language feature map that supports open-vocabulary queries. It then performs coherent edits using Coherent Score Distillation (CSD), which fuses a 2D image-editing diffusion model and a multi-view diffusion model to achieve multi-view-consistent, detail-rich modifications while weighting updates by Gaussian relevance. Empirical results show TIGER achieves superior open-vocabulary localization accuracy and more realistic, view-consistent edits than prior methods, demonstrating the practicality of direct 3D-space retrieval and diffusion-based editing for 3D Gaussian representations. The approach reduces dependence on repeated 2D segmentation and improves edit fidelity, enabling compositional edits with broader applicability in 3D scenes.

Abstract

Editing objects within a scene is a critical functionality required across a broad spectrum of applications in computer vision and graphics. As 3D Gaussian Splatting (3DGS) emerges as a frontier in scene representation, the effective modification of 3D Gaussian scenes has become increasingly vital. This process entails accurately retrieve the target objects and subsequently performing modifications based on instructions. Though available in pieces, existing techniques mainly embed sparse semantics into Gaussians for retrieval, and rely on an iterative dataset update paradigm for editing, leading to over-smoothing or inconsistency issues. To this end, this paper proposes a systematic approach, namely TIGER, for coherent text-instructed 3D Gaussian retrieval and editing. In contrast to the top-down language grounding approach for 3D Gaussians, we adopt a bottom-up language aggregation strategy to generate a denser language embedded 3D Gaussians that supports open-vocabulary retrieval. To overcome the over-smoothing and inconsistency issues in editing, we propose a Coherent Score Distillation (CSD) that aggregates a 2D image editing diffusion model and a multi-view diffusion model for score distillation, producing multi-view consistent editing with much finer details. In various experiments, we demonstrate that our TIGER is able to accomplish more consistent and realistic edits than prior work.

TIGER: Text-Instructed 3D Gaussian Retrieval and Coherent Editing

TL;DR

TIGER addresses text-driven retrieval and editing of explicit 3D Gaussian scenes by embedding language into each Gaussian primitive via a bottom-up language feature map that supports open-vocabulary queries. It then performs coherent edits using Coherent Score Distillation (CSD), which fuses a 2D image-editing diffusion model and a multi-view diffusion model to achieve multi-view-consistent, detail-rich modifications while weighting updates by Gaussian relevance. Empirical results show TIGER achieves superior open-vocabulary localization accuracy and more realistic, view-consistent edits than prior methods, demonstrating the practicality of direct 3D-space retrieval and diffusion-based editing for 3D Gaussian representations. The approach reduces dependence on repeated 2D segmentation and improves edit fidelity, enabling compositional edits with broader applicability in 3D scenes.

Abstract

Editing objects within a scene is a critical functionality required across a broad spectrum of applications in computer vision and graphics. As 3D Gaussian Splatting (3DGS) emerges as a frontier in scene representation, the effective modification of 3D Gaussian scenes has become increasingly vital. This process entails accurately retrieve the target objects and subsequently performing modifications based on instructions. Though available in pieces, existing techniques mainly embed sparse semantics into Gaussians for retrieval, and rely on an iterative dataset update paradigm for editing, leading to over-smoothing or inconsistency issues. To this end, this paper proposes a systematic approach, namely TIGER, for coherent text-instructed 3D Gaussian retrieval and editing. In contrast to the top-down language grounding approach for 3D Gaussians, we adopt a bottom-up language aggregation strategy to generate a denser language embedded 3D Gaussians that supports open-vocabulary retrieval. To overcome the over-smoothing and inconsistency issues in editing, we propose a Coherent Score Distillation (CSD) that aggregates a 2D image editing diffusion model and a multi-view diffusion model for score distillation, producing multi-view consistent editing with much finer details. In various experiments, we demonstrate that our TIGER is able to accomplish more consistent and realistic edits than prior work.
Paper Structure (21 sections, 4 equations, 19 figures, 1 table)

This paper contains 21 sections, 4 equations, 19 figures, 1 table.

Figures (19)

  • Figure 1: Our TIGER presents a systematic framework for 3D Gausssian retrieval and editing. TIGER integrates language features into each Gaussian primitive, and support open-vocabulary query directly in space. TIGER demonstrates excellent zero-shot retrieval capabilities, and enable detail preserving and multi-view consistent editing.
  • Figure 2: The pipeline of our method. We first embed language features into each Gaussian primitive. Upon receiving editing prompt, we compute a relevance score for each Gaussian w.r.t. the given edit prompt. Subsequently, we can update Gaussians using our CSD based on the relevancy scores.
  • Figure 3: Our Language Embedding Process: we use MaskCLIP to generate low-resolution semantic features with global context information, then upsample the low-resolution features into high-resolution for 3D Gaussian language supervision using FeatUp fu2024featup. To better preserve the sharp boundery, we apply SAM to the finest level and aggregate features with each fine mask. Finally, the refined language features are embedded into 3D Gaussians via differentiable rendering, enabling precise retrieval of relevant Gaussian points based on open-vocabulary query.
  • Figure 4: Our 2D Gaussian editing method use a Coherent Score Distillation that leverages 2D image editing diffusion model (InstructPix2Pix) for instruct-based editing and utilizes multi-view diffusion model (MVDream) to address multi-face inconsistency issue, and achieve multi-view consistent edits with fine details.
  • Figure 5: Qualitative comparison: TIGER method performs well in fine-grained localization.
  • ...and 14 more figures