Table of Contents
Fetching ...

Depth-supervised NeRF: Fewer Views and Faster Training for Free

Kangle Deng, Andrew Liu, Jun-Yan Zhu, Deva Ramanan

TL;DR

The paper tackles NeRF’s tendency to overfit and slow training when few input views are available by introducing DS-NeRF, which adds a depth supervision term derived from COLMAP’s sparse 3D keypoints. It formalizes depth supervision as aligning NeRF’s ray termination distribution with depth evidence through a KL-divergence loss that incorporates depth uncertainty. This approach is complementary to existing NeRF methods and works with various depth sources, including RGB-D data, yielding 2–3× faster training and improved geometry in sparse-view scenarios. Empirical results on DTU, NeRF Real, and Redwood demonstrate enhanced depth accuracy and view synthesis quality, especially in low-view settings, while maintaining compatibility with multiple depth signals.

Abstract

A commonly observed failure mode of Neural Radiance Field (NeRF) is fitting incorrect geometries when given an insufficient number of input views. One potential reason is that standard volumetric rendering does not enforce the constraint that most of a scene's geometry consist of empty space and opaque surfaces. We formalize the above assumption through DS-NeRF (Depth-supervised Neural Radiance Fields), a loss for learning radiance fields that takes advantage of readily-available depth supervision. We leverage the fact that current NeRF pipelines require images with known camera poses that are typically estimated by running structure-from-motion (SFM). Crucially, SFM also produces sparse 3D points that can be used as "free" depth supervision during training: we add a loss to encourage the distribution of a ray's terminating depth matches a given 3D keypoint, incorporating depth uncertainty. DS-NeRF can render better images given fewer training views while training 2-3x faster. Further, we show that our loss is compatible with other recently proposed NeRF methods, demonstrating that depth is a cheap and easily digestible supervisory signal. And finally, we find that DS-NeRF can support other types of depth supervision such as scanned depth sensors and RGB-D reconstruction outputs.

Depth-supervised NeRF: Fewer Views and Faster Training for Free

TL;DR

The paper tackles NeRF’s tendency to overfit and slow training when few input views are available by introducing DS-NeRF, which adds a depth supervision term derived from COLMAP’s sparse 3D keypoints. It formalizes depth supervision as aligning NeRF’s ray termination distribution with depth evidence through a KL-divergence loss that incorporates depth uncertainty. This approach is complementary to existing NeRF methods and works with various depth sources, including RGB-D data, yielding 2–3× faster training and improved geometry in sparse-view scenarios. Empirical results on DTU, NeRF Real, and Redwood demonstrate enhanced depth accuracy and view synthesis quality, especially in low-view settings, while maintaining compatibility with multiple depth signals.

Abstract

A commonly observed failure mode of Neural Radiance Field (NeRF) is fitting incorrect geometries when given an insufficient number of input views. One potential reason is that standard volumetric rendering does not enforce the constraint that most of a scene's geometry consist of empty space and opaque surfaces. We formalize the above assumption through DS-NeRF (Depth-supervised Neural Radiance Fields), a loss for learning radiance fields that takes advantage of readily-available depth supervision. We leverage the fact that current NeRF pipelines require images with known camera poses that are typically estimated by running structure-from-motion (SFM). Crucially, SFM also produces sparse 3D points that can be used as "free" depth supervision during training: we add a loss to encourage the distribution of a ray's terminating depth matches a given 3D keypoint, incorporating depth uncertainty. DS-NeRF can render better images given fewer training views while training 2-3x faster. Further, we show that our loss is compatible with other recently proposed NeRF methods, demonstrating that depth is a cheap and easily digestible supervisory signal. And finally, we find that DS-NeRF can support other types of depth supervision such as scanned depth sensors and RGB-D reconstruction outputs.

Paper Structure

This paper contains 23 sections, 10 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Training NeRFs can be difficult when given insufficient input images. We utilize additional supervision from depth recovered from 3D point clouds estimated from running structure-from-motion and impose a loss to ensure the rendered ray's termination distribution respects the surface priors given by the each keypoint. Because our supervision is complementary to NeRF, it can be combined with any such approach to reduce overfitting and speed up training.
  • Figure 2: Few view NeRF. NeRF is susceptible to overfitting when given few training views. As seen by the PSNR gap between train and test renders (left), NeRF has overfit and fails at synthesizing novel views. Further, the depth map (right) and depth error (middle) for NeRF suggest that its density function has failed to extract the surface geometry and can only reconstruct the training views' colors. Our depth-supervised NeRF model is able to render plausible geometry with consistently lower depth errors.
  • Figure 3: Ray Termination Distribution. (a) We plot various NeRF components over the distance traveled by the ray. Even if a ray traverses through multiple objects (as indicated by the multiple peaks of density $\sigma(t)$), we find that the ray termination distribution $h(t)$ is still unimodal. We find that NeRF models trained with sufficient supervision tend to have peaky, unimodal ray termination distributions as seen by the decreasing variance with more views in (c). We posit that the ideal ray termination distribution approaches a $\delta$ impulse function.
  • Figure 4: View Synthesis on DTU and Redwood. PixelNeRF, which is pre-trained on DTU, performs the best when given 3-views, although we find DS-NeRF to be visually competitive when more views are available. On Redwood, DS-NeRF is the only baseline to perform well on the 2-views setting.
  • Figure 5: Qualitative Comparison on NeRF Real. We render novel views and depth for various NeRF models trained on 2, 5, and 10 views. We find that methods trained with DTU struggle on NeRF Real while methods that use depth-supervision are able to render test views with realistic depth maps, even when only 2 views are provided. Please refer to nerf_real for quantitative comparisons.
  • ...and 4 more figures