VoxNeRF: Bridging Voxel Representation and Neural Radiance Fields for Enhanced Indoor View Synthesis

Sen Wang; Qing Cheng; Stefano Gasperini; Wei Zhang; Shun-Cheng Wu; Niclas Zeller; Daniel Cremers; Nassir Navab

VoxNeRF: Bridging Voxel Representation and Neural Radiance Fields for Enhanced Indoor View Synthesis

Sen Wang, Qing Cheng, Stefano Gasperini, Wei Zhang, Shun-Cheng Wu, Niclas Zeller, Daniel Cremers, Nassir Navab

TL;DR

VoxNeRF addresses indoor sparse-view novel view synthesis by integrating geometry priors into NeRF via a Sparse Voxel Octree (SVO) and a voxel-guided sampling strategy. The method models geometry uncertainty with a Gaussian around ray-surface intersections, densifies sampling near surfaces, and optimizes with a robust depth loss plus depth-gradient regularization within a Multi-resolution Hash Grid radiance framework, yielding $L = L_c + \lambda_d( L_d + L_{reg})$. Empirical results on ScanNet and ScanNet++ show VoxNeRF achieving state-of-the-art fidelity and significantly faster training compared to prior methods, especially in extrapolation scenarios where view overlap is limited. This geometry-guided approach enhances robustness to occlusions and textureless regions, offering practical benefits for indoor robotics, though it relies on preprocessing steps to obtain the priors. Overall, VoxNeRF demonstrates how structured scene priors can dramatically improve both the quality and efficiency of indoor neural rendering, paving the way for real-time robotic perception and interaction.

Abstract

The generation of high-fidelity view synthesis is essential for robotic navigation and interaction but remains challenging, particularly in indoor environments and real-time scenarios. Existing techniques often require significant computational resources for both training and rendering, and they frequently result in suboptimal 3D representations due to insufficient geometric structuring. To address these limitations, we introduce VoxNeRF, a novel approach that utilizes easy-to-obtain geometry priors to enhance both the quality and efficiency of neural indoor reconstruction and novel view synthesis. We propose an efficient voxel-guided sampling technique that allocates computational resources selectively to the most relevant segments of rays based on a voxel-encoded geometry prior, significantly reducing training and rendering time. Additionally, we incorporate a robust depth loss to improve reconstruction and rendering quality in sparse view settings. Our approach is validated with extensive experiments on ScanNet and ScanNet++ where VoxNeRF outperforms existing state-of-the-art methods and establishes a new benchmark for indoor immersive interpolation and extrapolation settings.

VoxNeRF: Bridging Voxel Representation and Neural Radiance Fields for Enhanced Indoor View Synthesis

TL;DR

. Empirical results on ScanNet and ScanNet++ show VoxNeRF achieving state-of-the-art fidelity and significantly faster training compared to prior methods, especially in extrapolation scenarios where view overlap is limited. This geometry-guided approach enhances robustness to occlusions and textureless regions, offering practical benefits for indoor robotics, though it relies on preprocessing steps to obtain the priors. Overall, VoxNeRF demonstrates how structured scene priors can dramatically improve both the quality and efficiency of indoor neural rendering, paving the way for real-time robotic perception and interaction.

Abstract

Paper Structure (20 sections, 12 equations, 4 figures, 4 tables)

This paper contains 20 sections, 12 equations, 4 figures, 4 tables.

Introduction
Related works
Novel View Synthesis
NeRF with Geometric Priors
Method
Preliminaries
Scene Geometry Prior Modeling
Efficient Sampling
Robust Geometry Regularization
Experiments and Results
Experimental Setup
Results
Results on ScanNet
Results on ScanNet++
Training Efficiency
...and 5 more sections

Figures (4)

Figure 1: By exploiting geometric priors, the proposed VoxNeRF generates better novel views while achieving faster optimization time compared to the method that uses geometry prior P2NeRF sun2024global and the one without any prior ZipNeRF barron2023zip.
Figure 2: Pipeline of the proposed VoxNeRF. First, we extract the scene geometry from various sources and transform it into a Sparse Voxel Octree (SVO). Next, we perform ray casting and efficiently sample points based on a Gaussian distribution. Finally, we introduce a robust geometry regularization term to improve rendering in ambiguous and textureless areas. The models are optimized by minimizing both the photometric loss and the robust depth term relative to the ground truth and pseudo ground truth.
Figure 3: The illustration demonstrates the rationale behind efficient sampling. The voxel is defined by sampling points estimated from the surface. Due to noise introduced during the surface generation process, the actual surface, in the worst-case scenario, lies at the upper-rightmost vertex. In this case, the ray $r'$ is the ray that intersects the cube from the farthest vertex, making the distance between the actual surface and the intersection point $z'(p_i')$ equal to $\sqrt{3} v_i$.
Figure 4: From Top to bottom, we selectively show the ground truth and extrapolation results rendered from different methods on the ScanNet dai2017scannet and ScanNet++ yeshwanthliu2023scannetpp datasets.

VoxNeRF: Bridging Voxel Representation and Neural Radiance Fields for Enhanced Indoor View Synthesis

TL;DR

Abstract

VoxNeRF: Bridging Voxel Representation and Neural Radiance Fields for Enhanced Indoor View Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (4)