Table of Contents
Fetching ...

DARF: Depth-Aware Generalizable Neural Radiance Field

Yue Shi, Dingyi Rong, Chang Chen, Chaofan Ma, Bingbing Ni, Wenjun Zhang

TL;DR

DARF introduces depth-aware generalization to NeRF by incorporating a depth prior through Depth-Aware Dynamic Sampling (DADS), which concentrates ray sampling near estimated surfaces using a depth interval $D=[d_c-\Delta d, d_c+\Delta d]$. A geometry-aware encoder–decoder (f_GARF) fuses multi-view features and predicts color and density via a transformer-based decoder, while MSC enforces multi-scale semantic consistency to preserve scene structure. Empirically, DARF reduces sampling points by about 50% and achieves state-of-the-art rendering quality and depth estimation on indoor and outdoor datasets, even without per-scene optimization, with robust performance under view-sparse conditions. The approach demonstrates practical impact by enabling efficient, generalizable neural rendering with unsupervised depth estimation, and provides code for replication.

Abstract

Neural Radiance Field (NeRF) has revolutionized novel-view rendering tasks and achieved impressive results. However, the inefficient sampling and per-scene optimization hinder its wide applications. Though some generalizable NeRFs have been proposed, the rendering quality is unsatisfactory due to the lack of geometry and scene uniqueness. To address these issues, we propose the Depth-Aware Generalizable Neural Radiance Field (DARF) with a Depth-Aware Dynamic Sampling (DADS) strategy to perform efficient novel view rendering and unsupervised depth estimation on unseen scenes without per-scene optimization. Distinct from most existing generalizable NeRFs, our framework infers the unseen scenes on both pixel level and geometry level with only a few input images. By introducing a pre-trained depth estimation module to derive the depth prior, narrowing down the ray sampling interval to the proximity space of the estimated surface, and sampling in expectation maximum position, we preserve scene characteristics while learning common attributes for novel-view synthesis. Moreover, we introduce a Multi-level Semantic Consistency loss (MSC) to assist with more informative representation learning. Extensive experiments on indoor and outdoor datasets show that compared with state-of-the-art generalizable NeRF methods, DARF reduces samples by 50%, while improving rendering quality and depth estimation. Our code is available on https://github.com/shiyue001/GARF.git.

DARF: Depth-Aware Generalizable Neural Radiance Field

TL;DR

DARF introduces depth-aware generalization to NeRF by incorporating a depth prior through Depth-Aware Dynamic Sampling (DADS), which concentrates ray sampling near estimated surfaces using a depth interval . A geometry-aware encoder–decoder (f_GARF) fuses multi-view features and predicts color and density via a transformer-based decoder, while MSC enforces multi-scale semantic consistency to preserve scene structure. Empirically, DARF reduces sampling points by about 50% and achieves state-of-the-art rendering quality and depth estimation on indoor and outdoor datasets, even without per-scene optimization, with robust performance under view-sparse conditions. The approach demonstrates practical impact by enabling efficient, generalizable neural rendering with unsupervised depth estimation, and provides code for replication.

Abstract

Neural Radiance Field (NeRF) has revolutionized novel-view rendering tasks and achieved impressive results. However, the inefficient sampling and per-scene optimization hinder its wide applications. Though some generalizable NeRFs have been proposed, the rendering quality is unsatisfactory due to the lack of geometry and scene uniqueness. To address these issues, we propose the Depth-Aware Generalizable Neural Radiance Field (DARF) with a Depth-Aware Dynamic Sampling (DADS) strategy to perform efficient novel view rendering and unsupervised depth estimation on unseen scenes without per-scene optimization. Distinct from most existing generalizable NeRFs, our framework infers the unseen scenes on both pixel level and geometry level with only a few input images. By introducing a pre-trained depth estimation module to derive the depth prior, narrowing down the ray sampling interval to the proximity space of the estimated surface, and sampling in expectation maximum position, we preserve scene characteristics while learning common attributes for novel-view synthesis. Moreover, we introduce a Multi-level Semantic Consistency loss (MSC) to assist with more informative representation learning. Extensive experiments on indoor and outdoor datasets show that compared with state-of-the-art generalizable NeRF methods, DARF reduces samples by 50%, while improving rendering quality and depth estimation. Our code is available on https://github.com/shiyue001/GARF.git.
Paper Structure (13 sections, 11 equations, 9 figures, 5 tables)

This paper contains 13 sections, 11 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Our framework. The proposed Depth-Aware Generalizable Neural Radiance Field (DARF) consists of three parts. First, deep convolutional features are extracted from input images to form learnable features of the scene. Second, a dynamic depth-aware sampling strategy is derived, based on the estimated depth prior provided by the pre-trained depth estimation foundation model. Finally, a decoder module is designed to predict color and density to render novel-view images along with fine depth map inference in a joint manner.
  • Figure 2: Illustration of the depth-aware dynamic sampling. Compared with the sampling strategy in NeRF, our proposed DADS strategy tends to distribute more sample points near the surface.
  • Figure 3: Rendering quality comparison. For every scene, the contents in the red and green boxes are displayed in the first and second lines respectively. The rendering results of PixelNeRF are blurry. IBRNet has some deletions in details and marginal areas. MVSNeRF is not realistic in color and is blurred on geometric edges. Our results show high-quality rendering quality. The upper rows of zoom-in images correspond to contents in red boxes.
  • Figure 4: Depth maps derived from four methods on DTU. Our method achieves significantly more accurate depth than the others, illustrating the effectiveness of the DADS. Our depth estimation results show clear boundaries and improved surface continuity. The results also explain the advantage of our method in rendering new perspectives.
  • Figure 5: Boxplot illustration of user study. Our method demonstrates better performance (high means) and stability across various test scenes (narrow interquartile range).
  • ...and 4 more figures