Table of Contents
Fetching ...

Depth Priors in Removal Neural Radiance Fields

Zhihao Guo, Peng Wang

TL;DR

The paper tackles editing NeRF-reconstructed scenes via object removal while preserving cross-view consistency. It proposes a pipeline that uses SpinNeRF as the base, augmented with monocular depth priors from ZoeDepth, and validates COLMAP dense depth as a practical ground-truth proxy on KITTI. Experiments show ZoeDepth priors yield higher PSNR and substantially lower depth-prior acquisition time (from 44.5s to 0.58s per frame) compared to DSNeRF-based depth priors. The results imply faster, more robust high-fidelity object removal, enabling scalable digital twin creation with greater efficiency.

Abstract

Neural Radiance Fields (NeRF) have achieved impressive results in 3D reconstruction and novel view generation. A significant challenge within NeRF involves editing reconstructed 3D scenes, such as object removal, which demands consistency across multiple views and the synthesis of high-quality perspectives. Previous studies have integrated depth priors, typically sourced from LiDAR or sparse depth estimates from COLMAP, to enhance NeRF's performance in object removal. However, these methods are either expensive or time-consuming. This paper proposes a new pipeline that leverages SpinNeRF and monocular depth estimation models like ZoeDepth to enhance NeRF's performance in complex object removal with improved efficiency. A thorough evaluation of COLMAP's dense depth reconstruction on the KITTI dataset is conducted to demonstrate that COLMAP can be viewed as a cost-effective and scalable alternative for acquiring depth ground truth compared to traditional methods like LiDAR. This serves as the basis for evaluating the performance of monocular depth estimation models to determine the best one for generating depth priors for SpinNeRF. The new pipeline is tested in various scenarios involving 3D reconstruction and object removal, and the results indicate that our pipeline significantly reduces the time required for the acquisition of depth priors for object removal and enhances the fidelity of the synthesized views, suggesting substantial potential for building high-fidelity digital twin systems with increased efficiency in the future.

Depth Priors in Removal Neural Radiance Fields

TL;DR

The paper tackles editing NeRF-reconstructed scenes via object removal while preserving cross-view consistency. It proposes a pipeline that uses SpinNeRF as the base, augmented with monocular depth priors from ZoeDepth, and validates COLMAP dense depth as a practical ground-truth proxy on KITTI. Experiments show ZoeDepth priors yield higher PSNR and substantially lower depth-prior acquisition time (from 44.5s to 0.58s per frame) compared to DSNeRF-based depth priors. The results imply faster, more robust high-fidelity object removal, enabling scalable digital twin creation with greater efficiency.

Abstract

Neural Radiance Fields (NeRF) have achieved impressive results in 3D reconstruction and novel view generation. A significant challenge within NeRF involves editing reconstructed 3D scenes, such as object removal, which demands consistency across multiple views and the synthesis of high-quality perspectives. Previous studies have integrated depth priors, typically sourced from LiDAR or sparse depth estimates from COLMAP, to enhance NeRF's performance in object removal. However, these methods are either expensive or time-consuming. This paper proposes a new pipeline that leverages SpinNeRF and monocular depth estimation models like ZoeDepth to enhance NeRF's performance in complex object removal with improved efficiency. A thorough evaluation of COLMAP's dense depth reconstruction on the KITTI dataset is conducted to demonstrate that COLMAP can be viewed as a cost-effective and scalable alternative for acquiring depth ground truth compared to traditional methods like LiDAR. This serves as the basis for evaluating the performance of monocular depth estimation models to determine the best one for generating depth priors for SpinNeRF. The new pipeline is tested in various scenarios involving 3D reconstruction and object removal, and the results indicate that our pipeline significantly reduces the time required for the acquisition of depth priors for object removal and enhances the fidelity of the synthesized views, suggesting substantial potential for building high-fidelity digital twin systems with increased efficiency in the future.
Paper Structure (20 sections, 13 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 20 sections, 13 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overview of the proposed object removal pipeline. Starting with sparse views and corresponding masks as inputs, multi-view segmentation (indicated by the red arrow) is employed to generate consistent masks. Depth maps and inpainted depth maps as priors are produced by ZoeDepth (depicted by the blue arrow). The green arrow highlights the multi-view consistent inpainting process, which integrates inpainted depth maps and RGB images into the updated NeRF model to render novel views.
  • Figure 2: Depth Map Estimation Comparison. From top to bottom: the raw image, ground truth depth map, COLMAP dense depth map, EcoDepth, Depth Anything, and ZoeDepth.
  • Figure 3: Estimated Depth Map Comparison. From left to right columns: input images; depth maps from COLMAP; depth maps from EcoDepth; depth maps from Depth Anything; depth maps from ZoeDepth.
  • Figure 4: Depth Map Comparison on Input Image and Inpainted Image. Top Row, from left to right: the input image; the depth map obtained by DSNeRF; the inpainted image; the inpainted depth map. Bottom Row, from left to right: the input image; the depth map obtained by ZoeDepth; the inpainted image; the inpainted depth map
  • Figure 5: Rendered views.Top Row: depth priors from DSNeRF; Bottom Row: depth priors form ZoeDepth.
  • ...and 1 more figures