Table of Contents
Fetching ...

Free-Editor: Zero-shot Text-driven 3D Scene Editing

Nazmul Karim, Hasan Iqbal, Umar Khalid, Jing Hua, Chen Chen

TL;DR

The paper tackles the challenge of editing 3D scenes without retraining diffusion models for each new edit or scene. It introduces Free-Editor, a training-free pipeline that leverages a generalized NeRF and an Edit Transformer to transfer edits from a single edited starting view to all target views, enforcing intra-view consistency and inter-view style transfer via self-view and cross-view attention and an epipolar-based feature aggregation. The method employs a loss suite including $\mathcal{L}_{mse}$, $\mathcal{L}_{con}$, $\mathcal{L}_{self}$, and $\mathcal{L}_{en}$, alongside a data-generation setup using BLIP, GPT, and IP2P prompts to realize zero-shot editing. Empirical results on LLFF and related datasets demonstrate substantial speedups (approximately 20x faster) and competitive editing quality, with qualitative and quantitative analyses, ablations, and user studies validating the approach. The work significantly advances practical 3D scene editing by removing the need for per-scene retraining while maintaining 3D consistency and enabling broad, text-guided edits in real time.

Abstract

Text-to-Image (T2I) diffusion models have recently gained traction for their versatility and user-friendliness in 2D content generation and editing. However, training a diffusion model specifically for 3D scene editing is challenging due to the scarcity of large-scale datasets. Currently, editing 3D scenes necessitates either retraining the model to accommodate various 3D edits or developing specific methods tailored to each unique editing type. Moreover, state-of-the-art (SOTA) techniques require multiple synchronized edited images from the same scene to enable effective scene editing. Given the current limitations of T2I models, achieving consistent editing effects across multiple images remains difficult, leading to multi-view inconsistency in editing. This inconsistency undermines the performance of 3D scene editing when these images are utilized. In this study, we introduce a novel, training-free 3D scene editing technique called \textsc{Free-Editor}, which enables users to edit 3D scenes without the need for model retraining during the testing phase. Our method effectively addresses the issue of multi-view style inconsistency found in state-of-the-art (SOTA) methods through the implementation of a single-view editing scheme. Specifically, we demonstrate that editing a particular 3D scene can be achieved by modifying only a single view. To facilitate this, we present an Edit Transformer that ensures intra-view consistency and inter-view style transfer using self-view and cross-view attention mechanisms, respectively. By eliminating the need for model retraining and multi-view editing, our approach significantly reduces editing time and memory resource requirements, achieving runtimes approximately 20 times faster than SOTA methods. We have performed extensive experiments on various benchmark datasets, showcasing the diverse editing capabilities of our proposed technique.

Free-Editor: Zero-shot Text-driven 3D Scene Editing

TL;DR

The paper tackles the challenge of editing 3D scenes without retraining diffusion models for each new edit or scene. It introduces Free-Editor, a training-free pipeline that leverages a generalized NeRF and an Edit Transformer to transfer edits from a single edited starting view to all target views, enforcing intra-view consistency and inter-view style transfer via self-view and cross-view attention and an epipolar-based feature aggregation. The method employs a loss suite including , , , and , alongside a data-generation setup using BLIP, GPT, and IP2P prompts to realize zero-shot editing. Empirical results on LLFF and related datasets demonstrate substantial speedups (approximately 20x faster) and competitive editing quality, with qualitative and quantitative analyses, ablations, and user studies validating the approach. The work significantly advances practical 3D scene editing by removing the need for per-scene retraining while maintaining 3D consistency and enabling broad, text-guided edits in real time.

Abstract

Text-to-Image (T2I) diffusion models have recently gained traction for their versatility and user-friendliness in 2D content generation and editing. However, training a diffusion model specifically for 3D scene editing is challenging due to the scarcity of large-scale datasets. Currently, editing 3D scenes necessitates either retraining the model to accommodate various 3D edits or developing specific methods tailored to each unique editing type. Moreover, state-of-the-art (SOTA) techniques require multiple synchronized edited images from the same scene to enable effective scene editing. Given the current limitations of T2I models, achieving consistent editing effects across multiple images remains difficult, leading to multi-view inconsistency in editing. This inconsistency undermines the performance of 3D scene editing when these images are utilized. In this study, we introduce a novel, training-free 3D scene editing technique called \textsc{Free-Editor}, which enables users to edit 3D scenes without the need for model retraining during the testing phase. Our method effectively addresses the issue of multi-view style inconsistency found in state-of-the-art (SOTA) methods through the implementation of a single-view editing scheme. Specifically, we demonstrate that editing a particular 3D scene can be achieved by modifying only a single view. To facilitate this, we present an Edit Transformer that ensures intra-view consistency and inter-view style transfer using self-view and cross-view attention mechanisms, respectively. By eliminating the need for model retraining and multi-view editing, our approach significantly reduces editing time and memory resource requirements, achieving runtimes approximately 20 times faster than SOTA methods. We have performed extensive experiments on various benchmark datasets, showcasing the diverse editing capabilities of our proposed technique.
Paper Structure (14 sections, 14 equations, 7 figures, 6 tables)

This paper contains 14 sections, 14 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Multi-View Inconsistency in Current Text-to-Image (T2I) Editing Models: The current T2I editing model brooks2023instructpix2pix faces significant challenges with multi-view consistency. This issue adversely affects the quality of 3D scene editing, especially when these edited views are used to synthesize novel views. This specific limitation is also acknowledged in IN2N haque2023instruct. Note that this inconsistency is particularly problematic when editing is performed without re-training, which aligns with our objectives.
  • Figure 2: 3D Scene Editing using our proposed method for different target poses.
  • Figure 3: Overview of our proposed method. Top Left. We train a generalized NeRF ($\mathbf{G}(.)$) model that takes an edited starting view and $M$ source views to render a novel target view. Here, the edited target view is not the input to the model, rather will be rendered and works as the ground truth for the model output. In $\mathbf{G}(.)$, we employ a novel Edit transformer that utilizes: Bottom Left. cross-view attention to produce style-informed source feature maps that will be aggregated through an Epipolar transformer. Top Right. During training, we employ different sets of source views $S_{a}, S_{b}, S'_{b}$ for 4 different loss functions. Note that $S'_{a}$ is a variant of $S_{a}$ with additional ray information for calculating $\mathcal{L}_{con}$. Bottom Right. During inference, only a single image needs to be edited to obtain a 3D-edited scene.
  • Figure 4: Text-driven 3D scene editing. Illustration of text-driven 3D scene editing using our proposed method across various target poses. This figure showcases the view-consistent results generated by our method. A qualitative evaluation on multiple scenes reveals the efficacy of our approach: starting from a single view, our method successfully generates novel views that are conditioned on the editing prompt, demonstrating its robustness and versatility in 3D scene editing.
  • Figure 5: Style Transfer Comparison. Exhibiting proficiency in conducting style edits within 3D NeRF Scenes, our method exemplifies its versatility and precision through intricate modifications and advanced prompt-guided editing in a three-dimensional environment. Visually, our outcomes resemble those of IN2N, since both methods utilize IP2P for 2D image editing. However, our method tends to preserve background details more effectively than IN2N.
  • ...and 2 more figures