TIGER: Text-Instructed 3D Gaussian Retrieval and Coherent Editing
Teng Xu, Jiamin Chen, Peng Chen, Youjia Zhang, Junqing Yu, Wei Yang
TL;DR
TIGER addresses text-driven retrieval and editing of explicit 3D Gaussian scenes by embedding language into each Gaussian primitive via a bottom-up language feature map that supports open-vocabulary queries. It then performs coherent edits using Coherent Score Distillation (CSD), which fuses a 2D image-editing diffusion model and a multi-view diffusion model to achieve multi-view-consistent, detail-rich modifications while weighting updates by Gaussian relevance. Empirical results show TIGER achieves superior open-vocabulary localization accuracy and more realistic, view-consistent edits than prior methods, demonstrating the practicality of direct 3D-space retrieval and diffusion-based editing for 3D Gaussian representations. The approach reduces dependence on repeated 2D segmentation and improves edit fidelity, enabling compositional edits with broader applicability in 3D scenes.
Abstract
Editing objects within a scene is a critical functionality required across a broad spectrum of applications in computer vision and graphics. As 3D Gaussian Splatting (3DGS) emerges as a frontier in scene representation, the effective modification of 3D Gaussian scenes has become increasingly vital. This process entails accurately retrieve the target objects and subsequently performing modifications based on instructions. Though available in pieces, existing techniques mainly embed sparse semantics into Gaussians for retrieval, and rely on an iterative dataset update paradigm for editing, leading to over-smoothing or inconsistency issues. To this end, this paper proposes a systematic approach, namely TIGER, for coherent text-instructed 3D Gaussian retrieval and editing. In contrast to the top-down language grounding approach for 3D Gaussians, we adopt a bottom-up language aggregation strategy to generate a denser language embedded 3D Gaussians that supports open-vocabulary retrieval. To overcome the over-smoothing and inconsistency issues in editing, we propose a Coherent Score Distillation (CSD) that aggregates a 2D image editing diffusion model and a multi-view diffusion model for score distillation, producing multi-view consistent editing with much finer details. In various experiments, we demonstrate that our TIGER is able to accomplish more consistent and realistic edits than prior work.
