Table of Contents
Fetching ...

Depth-guided NeRF Training via Earth Mover's Distance

Anita Rau, Josiah Aklilu, F. Christopher Holsinger, Serena Yeung-Levy

TL;DR

This paper tackles geometry learning in Neural Radiance Fields (NeRF) under sparse views where depth priors can be noisy and misleading.It introduces a depth-guided NeRF framework that uses off-the-shelf diffusion-based depth priors with uncertainty maps to steer ray termination distributions via Earth Mover's Distance (EMD) rather than enforcing exact depth via $L_2$ loss, and weights RGB vs depth guidance by uncertainty with a focal-loss-inspired scheme.Empirically, the method achieves strong depth metric improvements on ScanNet, outperforming Baselines like DäRF and SCADE while preserving photometric quality, and demonstrates robustness to out-of-domain data.The work provides a practical drop-in approach that improves NeRF geometry in indoor scenes by leveraging diffusion-based priors and uncertainty-aware EMD supervision, with clear avenues for extending uncertainty modeling.

Abstract

Neural Radiance Fields (NeRFs) are trained to minimize the rendering loss of predicted viewpoints. However, the photometric loss often does not provide enough information to disambiguate between different possible geometries yielding the same image. Previous work has thus incorporated depth supervision during NeRF training, leveraging dense predictions from pre-trained depth networks as pseudo-ground truth. While these depth priors are assumed to be perfect once filtered for noise, in practice, their accuracy is more challenging to capture. This work proposes a novel approach to uncertainty in depth priors for NeRF supervision. Instead of using custom-trained depth or uncertainty priors, we use off-the-shelf pretrained diffusion models to predict depth and capture uncertainty during the denoising process. Because we know that depth priors are prone to errors, we propose to supervise the ray termination distance distribution with Earth Mover's Distance instead of enforcing the rendered depth to replicate the depth prior exactly through L2-loss. Our depth-guided NeRF outperforms all baselines on standard depth metrics by a large margin while maintaining performance on photometric measures.

Depth-guided NeRF Training via Earth Mover's Distance

TL;DR

This paper tackles geometry learning in Neural Radiance Fields (NeRF) under sparse views where depth priors can be noisy and misleading.It introduces a depth-guided NeRF framework that uses off-the-shelf diffusion-based depth priors with uncertainty maps to steer ray termination distributions via Earth Mover's Distance (EMD) rather than enforcing exact depth via $L_2$ loss, and weights RGB vs depth guidance by uncertainty with a focal-loss-inspired scheme.Empirically, the method achieves strong depth metric improvements on ScanNet, outperforming Baselines like DäRF and SCADE while preserving photometric quality, and demonstrates robustness to out-of-domain data.The work provides a practical drop-in approach that improves NeRF geometry in indoor scenes by leveraging diffusion-based priors and uncertainty-aware EMD supervision, with clear avenues for extending uncertainty modeling.

Abstract

Neural Radiance Fields (NeRFs) are trained to minimize the rendering loss of predicted viewpoints. However, the photometric loss often does not provide enough information to disambiguate between different possible geometries yielding the same image. Previous work has thus incorporated depth supervision during NeRF training, leveraging dense predictions from pre-trained depth networks as pseudo-ground truth. While these depth priors are assumed to be perfect once filtered for noise, in practice, their accuracy is more challenging to capture. This work proposes a novel approach to uncertainty in depth priors for NeRF supervision. Instead of using custom-trained depth or uncertainty priors, we use off-the-shelf pretrained diffusion models to predict depth and capture uncertainty during the denoising process. Because we know that depth priors are prone to errors, we propose to supervise the ray termination distance distribution with Earth Mover's Distance instead of enforcing the rendered depth to replicate the depth prior exactly through L2-loss. Our depth-guided NeRF outperforms all baselines on standard depth metrics by a large margin while maintaining performance on photometric measures.
Paper Structure (26 sections, 10 equations, 15 figures, 7 tables)

This paper contains 26 sections, 10 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Left: Predicted monocular depth priors are not perfect and false interpretations of a scene's geometry are unavoidable. Blindly forcing a NeRF to replicate such priors (e.g. through $L_2$-loss) leads to high geometric losses. Right: Overview of our method. We use depth priors to guide NeRF training via Earth Mover's Distance (EMD).
  • Figure 2: A detailed schematic of our depth-guided NeRF optimization. (i) A pretrained diffusion model for depth prediction, DiffDP ddp, provides depth priors. Measuring the progression of depth predictions throughout the denoising process provides uncertainty maps. (ii) Given inputs poses ($x$,$y$,$z$,$\theta$,$\phi$) a network $F$ outputs RGB value and density. From the outputs, we derive weights $w$ that, when normalized, serve as a piece-wise-constant probability density function. We can then construct a cumulative distribution function (CDF) from which we sample new ray termination distances. We supervise these samples with the Earth Mover's Distance (EMD) to the depth prior. (iii) Finally, we weigh the photometric and depth losses according to the DiffDP-derived uncertainty.
  • Figure 3: Example of DiffDP depth prediction that misinterprets the depicted geometry. Although the predicted depth has errors, the uncertainty map is able to highlight areas of large errors. This allows us to tune down the depth loss in unreliable areas.
  • Figure 4: Good RGB rendering quality does not imply good geometric understanding. In this test example, SCADE scade accurately renders the image (PSNR and SSIM above average), while misinterpreting the geometry of the depicted scene (five-fold average RMSE). The model does not capture the cabinet below the desk in the depth map.
  • Figure 5: Qualitative results of rendered RGB images and depth maps. Our method produces less artifacts in the left example and learns a better geometric representation of the table in the right example.
  • ...and 10 more figures