Table of Contents
Fetching ...

Towards a Training Free Approach for 3D Scene Editing

Vivek Madhavaram, Shivangana Rawat, Chaitanya Devaguptapu, Charu Sharma, Manohar Kaul

TL;DR

The paper tackles the bottleneck of training-based 3D scene editing (e.g., NeRF retraining) when performing text-guided edits in large scenes. It introduces FreeEdit, a training-free framework that uses mesh representations and foundation-models to enable insertion, replacement, and deletion of objects from text prompts in room-sized environments. The method integrates an LLM for task classification and entity extraction, text-conditioned 3D mesh synthesis (via Shap-E), OpenMask3D grounding, a scaling step, and an on-the-fly location-finding algorithm to minimize object-grounding object intersection, with no explicit supervision for placement. Quantitative and qualitative evaluations on diverse scenes demonstrate competitive performance and real-time interactivity relative to training-based baselines, highlighting the approach's potential for open-vocabulary, interactive 3D editing.

Abstract

Text driven diffusion models have shown remarkable capabilities in editing images. However, when editing 3D scenes, existing works mostly rely on training a NeRF for 3D editing. Recent NeRF editing methods leverages edit operations by deploying 2D diffusion models and project these edits into 3D space. They require strong positional priors alongside text prompt to identify the edit location. These methods are operational on small 3D scenes and are more generalized to particular scene. They require training for each specific edit and cannot be exploited in real-time edits. To address these limitations, we propose a novel method, FreeEdit, to make edits in training free manner using mesh representations as a substitute for NeRF. Training-free methods are now a possibility because of the advances in foundation model's space. We leverage these models to bring a training-free alternative and introduce solutions for insertion, replacement and deletion. We consider insertion, replacement and deletion as basic blocks for performing intricate edits with certain combinations of these operations. Given a text prompt and a 3D scene, our model is capable of identifying what object should be inserted/replaced or deleted and location where edit should be performed. We also introduce a novel algorithm as part of FreeEdit to find the optimal location on grounding object for placement. We evaluate our model by comparing it with baseline models on a wide range of scenes using quantitative and qualitative metrics and showcase the merits of our method with respect to others.

Towards a Training Free Approach for 3D Scene Editing

TL;DR

The paper tackles the bottleneck of training-based 3D scene editing (e.g., NeRF retraining) when performing text-guided edits in large scenes. It introduces FreeEdit, a training-free framework that uses mesh representations and foundation-models to enable insertion, replacement, and deletion of objects from text prompts in room-sized environments. The method integrates an LLM for task classification and entity extraction, text-conditioned 3D mesh synthesis (via Shap-E), OpenMask3D grounding, a scaling step, and an on-the-fly location-finding algorithm to minimize object-grounding object intersection, with no explicit supervision for placement. Quantitative and qualitative evaluations on diverse scenes demonstrate competitive performance and real-time interactivity relative to training-based baselines, highlighting the approach's potential for open-vocabulary, interactive 3D editing.

Abstract

Text driven diffusion models have shown remarkable capabilities in editing images. However, when editing 3D scenes, existing works mostly rely on training a NeRF for 3D editing. Recent NeRF editing methods leverages edit operations by deploying 2D diffusion models and project these edits into 3D space. They require strong positional priors alongside text prompt to identify the edit location. These methods are operational on small 3D scenes and are more generalized to particular scene. They require training for each specific edit and cannot be exploited in real-time edits. To address these limitations, we propose a novel method, FreeEdit, to make edits in training free manner using mesh representations as a substitute for NeRF. Training-free methods are now a possibility because of the advances in foundation model's space. We leverage these models to bring a training-free alternative and introduce solutions for insertion, replacement and deletion. We consider insertion, replacement and deletion as basic blocks for performing intricate edits with certain combinations of these operations. Given a text prompt and a 3D scene, our model is capable of identifying what object should be inserted/replaced or deleted and location where edit should be performed. We also introduce a novel algorithm as part of FreeEdit to find the optimal location on grounding object for placement. We evaluate our model by comparing it with baseline models on a wide range of scenes using quantitative and qualitative metrics and showcase the merits of our method with respect to others.

Paper Structure

This paper contains 4 sections, 2 figures.

Figures (2)

  • Figure 1: Illustration of FreeEdit for inserting and replacing objects in a complex 3D scene: Queries in blue and orange illustrate the insertion and replacement prompts provided as input by the user, respectively.
  • Figure 2: FreeEdit: Object insertion in a 3D scene. Given a text prompt, LLM classifies the task and extracts primary and grounding entities. Object synthesis for primary object is done by Shap-E. OpenMask3D does object grounding. Scaling of primary object is performed. Location finder computes an optimal location to place primary object on grounding object. Scaling and location finder (in blue) are not pre-trained models and run on the fly.