Table of Contents
Fetching ...

ReplaceAnything3D:Text-Guided 3D Scene Editing with Compositional Neural Radiance Fields

Edward Bartrum, Thu Nguyen-Phuoc, Chris Xie, Zhengqin Li, Numair Khan, Armen Avetisyan, Douglas Lanman, Lei Xiao

TL;DR

ReplaceAnything3D (RAM3D) tackles the challenge of editing 3D scenes from text prompts with multi-view consistency. It introduces a two-stage Erase-and-Replace pipeline that first inpaints the background after erasing a target object and then generates a new object conditioned on a replacement prompt, all within a Bubble-NeRF framework to keep computations localized. By distilling 2D diffusion priors through HiFA-inspired losses and integrating a scene-aware LDM inpainting model, RAM3D achieves coherent, photorealistic edits that can also remove or add objects and support personalized content via Dreambooth-style fine-tuning. The approach yields improved visual fidelity and cross-view coherence across forward-facing and 360° scenes, offering a versatile tool for VR/MR, gaming, and film production.

Abstract

We introduce ReplaceAnything3D model (RAM3D), a novel text-guided 3D scene editing method that enables the replacement of specific objects within a scene. Given multi-view images of a scene, a text prompt describing the object to replace, and a text prompt describing the new object, our Erase-and-Replace approach can effectively swap objects in the scene with newly generated content while maintaining 3D consistency across multiple viewpoints. We demonstrate the versatility of ReplaceAnything3D by applying it to various realistic 3D scenes, showcasing results of modified foreground objects that are well-integrated with the rest of the scene without affecting its overall integrity.

ReplaceAnything3D:Text-Guided 3D Scene Editing with Compositional Neural Radiance Fields

TL;DR

ReplaceAnything3D (RAM3D) tackles the challenge of editing 3D scenes from text prompts with multi-view consistency. It introduces a two-stage Erase-and-Replace pipeline that first inpaints the background after erasing a target object and then generates a new object conditioned on a replacement prompt, all within a Bubble-NeRF framework to keep computations localized. By distilling 2D diffusion priors through HiFA-inspired losses and integrating a scene-aware LDM inpainting model, RAM3D achieves coherent, photorealistic edits that can also remove or add objects and support personalized content via Dreambooth-style fine-tuning. The approach yields improved visual fidelity and cross-view coherence across forward-facing and 360° scenes, offering a versatile tool for VR/MR, gaming, and film production.

Abstract

We introduce ReplaceAnything3D model (RAM3D), a novel text-guided 3D scene editing method that enables the replacement of specific objects within a scene. Given multi-view images of a scene, a text prompt describing the object to replace, and a text prompt describing the new object, our Erase-and-Replace approach can effectively swap objects in the scene with newly generated content while maintaining 3D consistency across multiple viewpoints. We demonstrate the versatility of ReplaceAnything3D by applying it to various realistic 3D scenes, showcasing results of modified foreground objects that are well-integrated with the rest of the scene without affecting its overall integrity.
Paper Structure (32 sections, 7 equations, 14 figures, 1 table)

This paper contains 32 sections, 7 equations, 14 figures, 1 table.

Figures (14)

  • Figure 1: Our method enables prompt-driven object replacement for a variety of realistic 3D scenes.
  • Figure 2: An overview of $\text{RAM3D}$Erase and Replace stages.
  • Figure 3: The masked region (blue) serves as a conditioning signal for the LDM, indicating the area to be inpainted. The nearby pixels surrounding $\mathbf{m}$ form the halo region $\mathbf{h}$ (green), which is also rendered volumetrically by $\text{RAM3D}$ during the Erase stage. The union of these 2 regions is the Bubble-NeRF region, whilst the remaining pixels are sampled from the input image (red).
  • Figure 4: Replace stage: $\text{RAM3D}$ volumetrically renders the masked pixels (shown in blue) to give $\mathbf{x}^{fg}$. The result is composited with $\mathbf{x}^{bg}$ to form the combined image $\mathbf{x}$.
  • Figure 5: Comparison with Instruct-NeRF2NeRF.
  • ...and 9 more figures