ReplaceAnything3D:Text-Guided 3D Scene Editing with Compositional Neural Radiance Fields

Edward Bartrum; Thu Nguyen-Phuoc; Chris Xie; Zhengqin Li; Numair Khan; Armen Avetisyan; Douglas Lanman; Lei Xiao

ReplaceAnything3D:Text-Guided 3D Scene Editing with Compositional Neural Radiance Fields

Edward Bartrum, Thu Nguyen-Phuoc, Chris Xie, Zhengqin Li, Numair Khan, Armen Avetisyan, Douglas Lanman, Lei Xiao

TL;DR

ReplaceAnything3D (RAM3D) tackles the challenge of editing 3D scenes from text prompts with multi-view consistency. It introduces a two-stage Erase-and-Replace pipeline that first inpaints the background after erasing a target object and then generates a new object conditioned on a replacement prompt, all within a Bubble-NeRF framework to keep computations localized. By distilling 2D diffusion priors through HiFA-inspired losses and integrating a scene-aware LDM inpainting model, RAM3D achieves coherent, photorealistic edits that can also remove or add objects and support personalized content via Dreambooth-style fine-tuning. The approach yields improved visual fidelity and cross-view coherence across forward-facing and 360° scenes, offering a versatile tool for VR/MR, gaming, and film production.

Abstract

We introduce ReplaceAnything3D model (RAM3D), a novel text-guided 3D scene editing method that enables the replacement of specific objects within a scene. Given multi-view images of a scene, a text prompt describing the object to replace, and a text prompt describing the new object, our Erase-and-Replace approach can effectively swap objects in the scene with newly generated content while maintaining 3D consistency across multiple viewpoints. We demonstrate the versatility of ReplaceAnything3D by applying it to various realistic 3D scenes, showcasing results of modified foreground objects that are well-integrated with the rest of the scene without affecting its overall integrity.

ReplaceAnything3D:Text-Guided 3D Scene Editing with Compositional Neural Radiance Fields

TL;DR

Abstract

Paper Structure (32 sections, 7 equations, 14 figures, 1 table)

This paper contains 32 sections, 7 equations, 14 figures, 1 table.

Introduction
Related work
Diffusion model for text-guided image editing
Neural radiance fields editing
Text-to-3D synthesis
Preliminary
NeRF
Distilling text-to-image diffusion models
Method
Overview
Erase stage
Replace stage
Training the final NeRF
Results
Training details
...and 17 more sections

Figures (14)

Figure 1: Our method enables prompt-driven object replacement for a variety of realistic 3D scenes.
Figure 2: An overview of $\text{RAM3D}$Erase and Replace stages.
Figure 3: The masked region (blue) serves as a conditioning signal for the LDM, indicating the area to be inpainted. The nearby pixels surrounding $\mathbf{m}$ form the halo region $\mathbf{h}$ (green), which is also rendered volumetrically by $\text{RAM3D}$ during the Erase stage. The union of these 2 regions is the Bubble-NeRF region, whilst the remaining pixels are sampled from the input image (red).
Figure 4: Replace stage: $\text{RAM3D}$ volumetrically renders the masked pixels (shown in blue) to give $\mathbf{x}^{fg}$. The result is composited with $\mathbf{x}^{bg}$ to form the combined image $\mathbf{x}$.
Figure 5: Comparison with Instruct-NeRF2NeRF.
...and 9 more figures

ReplaceAnything3D:Text-Guided 3D Scene Editing with Compositional Neural Radiance Fields

TL;DR

Abstract

ReplaceAnything3D:Text-Guided 3D Scene Editing with Compositional Neural Radiance Fields

Authors

TL;DR

Abstract

Table of Contents

Figures (14)